IRE 


‘Transactions 
on INFORMATION THEORY 


Vol. IT-2, No. 3 September, 1956 


1956 SYMPOSIUM ON INFORMATION THEORY 
held at 


Massachusetts Institute of Technology 


Cambridge, Massachusetts 


September 10-12, 1956 


es 
7 


PUBLISHED BY THE 


Professional Group on Information Theory 


IRE PROFESSIONAL GROUP ON INFORMATION THEORY 


The Professional Group on Information Theory is an organization, within the 
framework of the IRE, of members with principal professional interest in Infor- 
mation Theory. All members of the IRE are eligible for membership inthe 


Group and will receive all Group publications upon payment of prescribed 


assessments. 
Annual Assessment: $2.00 
Administrative Committee 
MicHaeEL J. DiToro, Chairman 
Wixpur B. DAvENPorRT, JR., Vice-Chairman 
Haroitp R. Hotioway, Secretary-Treasurer 
T. P. CHEATHAM Rosert M. Fano NATHAN MARCHAND 
Harry Davis LaurRIN G. FISCHER WINSLOW PALMER 
Louis A. DERosA M. J. E. Gotay F. L. H. M. Stumprrs 
Donat B. DuNcAN Ernest R. KRETZMER Warren D. WHITE 


IRE TRANSACTIONS® 


on Information Theory 


Published by the Institute of Radio Engineers, Inc., for the Professional Group on 
Information Theory, 1 East 79th Street, New York 21, N. Y. Responsibility for the 
contents rests upon the authors, and not upon the IRE, the Group or its members. 
Individual copies available for sale to IRE-PGIT members at $3.00, to IRE members 
at $4.50 and to nonmembers at $9.00. 


©1956 — Tue InstiTuTE oF Rapio ENGINEERS, INc. 
All rights, including translation, are reserved by the IRE. Requests for republication privi- 


leges should be addressed to the Institute of Radio Engineers, 1 E. 79th St., New York 21, N. Y. 


TRANSACTIONS 
of the 


1956 SYMPOSIUM ON INFORMATION THEORY 
held at 
Massachusetts Institute of Technology, Cambridge, Massachusetts 


September 10-12, 1956 


Organized by 
The Professional Group on Information Theory, Institute of Radio Engineers 


In cooperation with 
The Research Laboratory of Electronics, Massachusetts Institute of Technology 


and sponsored by 


The International Scientific Radio Union (URSI) 
The Office of Naval Research 
The Signal Corps Engineering Laboratories 
The Air Research and Development Command 


Organizing Committee 
P. Elias, Chairman 


T. P. Cheatham Y. W. Lee O. G. Selfridge 
R. M. Fano W. A. Rosenblith C. E. Shannon 
P. E. Green, Jr. R. A. Sayers J. B. Wiesner 


These papers are to be presented at the 1956 Symposium 
on Information Theory. They are published prior to 


the Symposium to allow informed discussion at the 
meeting. 


CONTENTS AND ABSTRACTS 


Page 
CODING I 


"The Zero Error Capacity of a Noisy Channel," by C. E. Shannon.....+e-eee. 8 


The zero error capacity C. of a noisy channel is defined as the least 
upper bound of rates at which it is possible to transmit information with 
zero probability of error. Various properties of C, are studied; upper 
and lower bounds and methods of evaluation of C, are given.- Inequalities 
are obtained for the C, relating to the "sum" and "product" of two given 
channels. The analogous problem of zero error capacity Cop for a channel 
with a feedback link is considered. It is shown that while the ordinary 
capacity of a memoryless channel with feedback is equal to that of the 
same channel without feedback, the zero error capacity may be greater. A 
solution is given to the problem of evaluating Cop. 


"A Linear Circuit Viewpoint on Error-Correcting Codes," by Ds AtecHuffmanssote0 


A linear binary filter has as its output a binary sequence, each digit 
of which is the result of a parity check on a selection of preceding 
output digits and of present and preceding digits of the filter input 
sequence. The terminal properties of these filters may be described by 
transfer ratios of polynomials in a delay operator. If two binary fil- 
ters have transfer ratios which are reciprocally related then the filters 
are mutually inverse in the sense that, in a cascade connection, the 
second filter unscrambles the scrambling produced by the first. The cod- 
ing of a finite sequence of binary information digits for protection 
against noise may be accomplished by a binary sequence filter, the output 
of which becomes the sequence to be transmitted. The inverse filter is 
utilized at the receiver. 


CODING IT 
"Theory of Information .Feedback-Systems,” by"St-So uy Chane ..cssdeseceses ee 


A general information feedback system is defined and formulated in a 
way broad enough to allow coded or uncoded channels with total or partial 
information feedback. Basic theorems governing change in information 
rate and reliability are derived with full consideration of the transi- 
tion probabilities of both direct and feedback channels, including mes- 
sage words as well as the confirmation--denial signal. 


"A Linear Coding for Transmitting a Set of Correlated Signals," by 
His P.' Kramer and M. a MAtNOWS es cssleie se o'c'e oes cece seiiecees codes oeuswuerwee Ta 


A coding scheme is described for the transmission of n continuous 
correlated signals over m channels, m being equal to or less than n. 
Each of the m signals is a linear combination of the n original signals. 


3 


"On an Application of Semi-Group Methods to Some Problems in Coding," by 
M. Pe SchutZEnber er. ccceccvccvccccscccccsesscvesvesevescesenvessvesees h7 


We give an abstract model of some sort of language and try to show | 
how semi-group concepts apply fruitfully to it with the hope that some of 
them may be of interest to specialists working on natural languages. In 
a first part, the model and its main properties are discussed at a con- 
crete level on the simplest cases: coding and decoding with length- 
bounded codes. In a second part a selection of theorems are proved 
whenever the necessary semi-group-theoretic peas en are. not 
exacting. 


AUTOMATA 
"The Logic Theory Machine," by A. Newell and H. A. Simonessecoeccccccscoee OL 


In this paper we describe a complex information processing system, | 
which we call the logic theory machine, that is capable of discovering 
proofs for theorems in symbolic logic. This system relies heavily on 
heuristic methods similar to those that have been observed in human 
problem solving activity. The present paper is concerned with specifica- 
tion of the system, and not with its realization in a computer. 


"Tests on a Cell Assembly Theory of the Action of the Brain, Using a Large 
Digital Computer," by N. Rochester, J. H. Holland, L. H. Haibt and 


We lite DUA. seecerereccceccccserccececcceccccccecvscsasrccsseseseescsceces 80 


Theories by D. 0. Hebb and P. M. Milner on how the brain works were 
tested by simulating neuron nets on the IBM Type 70) Electronic Calcula- 
tor. The cell assemblies do not yet act just as the theory-requires, but 
changes in the theory and the simulation offer promise for further 
experimentation. 


INFORMATION SOURCES 


"The Measurement of Third Order Probability Distributions of Television 
Signals," by W.. Es SCHYCLDST aic.c sie: 6-6i0.d boa Seerelerelele ol Moe eeioleeio nea ee oh 


A device has been built for the rapid, automatic measurement of the 
third order probability density of video signals. Examples are presented 
of second and third order distributions, and of entropies calculated for 
a variety of scenes. 


"Gap Analysis and Syntax," by Wis, He. YDS VE cass oreje,0 a5 eeiele sieve siete aie eee ee Oe 


A statistical procedure has been tried as a method of investigating 
the structure of language with the aid of data processing machines. The 
frequency of gaps of various lengths between occurrences of two specified 
words is counted. The results are compared with what would be expected 
if the occurrences of the two words were statistically independent. 
Deviations from the expected number give clues to the constraints that 
operate between words in a language. . 


"Three Modéks ‘for ithe Description of Language," by A. N. Chomsky <ecs-ececc 113 


‘We investigate «several conceptions of linguistic structure to deter- 
omine whether or :not they«can provide simple and "revealing" grammars that 
generate all of ‘the ‘sentenges of English and only these. We find that no 
finite-state Markov process that produces symbols with transition from 
state to state can serve as an:English grammar. We formalize the notion 
of "phrase structure" and show that this gives us a method for describing 
language which is essentially more powerful. We study the properties of 
a set of grammatical transformations, showing that the grammar of English 
is materially simplified if phrase-structure description is limited to a 
kernel of simple sentences from which all other sentences are constructed 
by repeated transformations, and that this view of linguistic structure 
gives a certain insight into the use and understanding of language. 


INFORMATION USERS 
"Some Studies in the Speed of Visual Perception," by G. C. Sziklai...es.se.. 125 


Statistical studies of television signals indicated a high degree of 
correlation between successive elements, lines and frames. Some tests 
were devised to measure the perception speed of observers. These tests 
included certain reading and character recognition tests and finally a 
test consisting of object recognition in precisely measured periods was 
devised. Several series of these tests indicated that the visual percep- 
tion speed of a normal observer is between 30 and 50 bits per second, 
that this value holds for periods of one-tenth to two seconds, and that 
the first thing observed is the center of the picture. 


"Human Memory and the Storage of Information," by G. A. Miller....eeccoee. 129 


The amount of selective information in a message can be increased 
either by increasing the variety of the symbols from which it is composed 
or by increasing the length of the message. The variety of the symbols 
is far less important than the length of the message in controlling what 
human subjects are able to remember. 


"The Human Use of Information III. Decision-Making in Signal Detection 
and Recognition Situations Involving Multiple Alternatives," by 
Js Ae Syets and lve Ge Birdsall ls. <o6 sce 6000 e6 os civice cis se sie 6c 0 6.ele ce oie 01ss.6 138 


A general theory of signal detectability, constructed after the model 
provided by decision theory, is applied to the performance of the human 
observer faced with the problem of choosing among multiple signal alter- 
natives on the basis of a fixed, finite observation interval. The 
results indicate that a highly simplified theory is adequate for predic=- 
tion of the obtained payoff and response-frequency tables to within a 
few per cent. They also indicate the fairly large extent to which 
intelligence may influence a sensory process usually assumed to involve 
fixed parameters. 


OPTIMUM MEAN-SQUARE OPERATIONS 


"On Optimum Non-Linear Extraction ahd Coding Filters," by A. V. 
Balakrishnan and Rails Drenickececoeccceceveccoscessccescecceseesecsees 166 


The problem of determining optimal non-linear least-square filters is 
solved for a class of stationary time series. This theory is then used 
as the basis for developing a band-width reduction scheme using non- 
linear encoding and decoding filters, for the same class of signals. A 
simple illustrative example is included. 


"Final-Value Systems with Gaussian Inputs," by R. C. Booton, Jr..eccoeeeee 173 


A firal-value system controls a response variable r(t) over a time 
interval (0,T) with the objective of minimizing the difference between a 
desired value f and the final response value r(T). Physical limitations 
of the element being controlled result in a maximum-value constraint on 
the system velocity r'(t). Earlier results suggest that a system 
consisting of an estimator followed by a "bang-bang" servo is approxi- 
mately optimum. The estimator uses the input to produce an estimate Px 
of the desired response and the servo results in a system velocity as 
large in magnitude as possible and with the same sign as the difference 
px -r. The present paper shows that this system is the true optimum 
when the joint distribution of the input and the desired response is 
Gaussian and the error criterion is minimization of the average of a 
nondecreasing function of the magnitude of the error. 


"An Extension of the Minimum Mean Square Prediction Error Theory for 
Sampled Data," by M. BLUM se'dle cole 0 eee cole bees eared eo bie eee eo ceieelslcite eelaia ete o ereie 176 


A method is developed for finding the ordinates of a digital filter 
which will produce a general linear operator of the signal S(t) such 
that the mean square error of prediction will be a minimum. The input 
to the filter is sampled at intervals t. The samples contain stationary 
noise N(j t), a stationary signal component, M(j t), and a nonrandom 
signal component. The solution is obtained as a matrix equation which 
relates the ordinates of the digital filter to the autocorrelation 
properties of M(t) and N(t) and the nature of the prediction operation. 


APPLICATIONS 
"A New Interpretation of Information Rate," by J. L. Kellyeccccccccccccces 185 


If the input symbols to a communication channel represent the outcomes 
of a chance event on which bets are available at odds consistent with — 
their probabilities (i.e., "fair" odds), a gambler can use the knowledge 
given him by the received symbols to cause his money to grow exponen- 
tially. The maximum exponential rate of growth of the gambler's capital 
is equal to the rate of transmission of information over the channel. 
Thus we find a situation in which the transmission rate has significance 
even though no coding is contemplated. . 


6 


"An Outline of a Purely Phenomenological Theory of Statistical 
Thermodynamics: I. Canonical Ensembles," by B. Mandelbrot...cceccocee. 190 


Since the kinetic foundations of thermodynamics are not sufficient 
in the absence of further hypotheses of randomness, are they necessary 
in the presence of such hypotheses? The aim of the paper is to show 
(partly after Szilard) that a substantial part of the results, usually 
obtained through kinetic arguments, could be obtained by postulating 
from the outset a statistical distribution for the properties of a 
system, and following up with a purely phenomenological argument. It is 
of interest to the communication engineer to have a unified treatment of 
the foundations of fluctuation phenomena and of methods of fighting 
noise. 


"A Radar Detection Philosophy," by W. McC. SLED. é.6.5.0'6.010 050 4/6 crnre sinrveisiaiehe 20h 


This paper attempts to present a short, unified discussion of the 
radar detection, parameter estimations, and multiple-signal resolution 
problems--mostiy from a philosophical rather than a detailed mathemati- 
cal point of view. The purpose is to make it possible in at least some 
limited sense to reason back from appropriate measures of desired radar 
performance to specifications of the necessary values of the related 
radar parameters. 


THE ZERO ERROR CAPACITY OF A NOISY CHANNEL 


Claude HE, Shannon 
Bell Telephone Laboratories, Murray Hill, New Jersey 
Massachusetts Institute of Technology, Cambridge, Mass. 


Abstract 


The zero error capacity Co of a noisy 
channel is defined as the least upper bound of 
rates at which it is possible to transmit infor- 
mation with zero probability of error. Various 
properties of C, are studied; upper and lower 
bounds and methods of evalvation of Cy are given. 
Inequalities are obtained for the Co relating to 
the "sum" and "product" of two given channels. 
The analogous problem of zero error capacity Cor 
for a channel with a feedback link is considered. 
It is shown that while the ordinary capacity of 
a memoryless channel with feedback is equal to 
that of the same channel without feedback, the 
zero error capacity may be greater. A solution 
is given to the problem of evaluating Cop. 


Introduction 


The ordinary capacity C of a noisy channel 
may be thought of as follows. There exists a 
sequence of codes for the channel of increasing 
block length such that the input rate of trans- 
mission approaches C and the probability of error 
in decoding at the receiving point approaches 
zero. Furthermore, this is not true for any 
value higher than C. In some situations it may 
be of interest to consider, rather than codes 
with probability of error approaching zero, codes 
for which the probability is zero and to 
investigate the highest possible rate of trans- 
mission (or the least upper bound of these rates) 
for such codes. This rate, Cy, is the main 
object of investigation of the present paper. 
It is interesting that while Co would appear to 
be a simpler property of a channel than C, it is 
in fact more difficult to calculate and leads to 
a number of as yet unsolved problems. 


We shall consider only finite discrete 
memoryless channels. Such a channel is specified 
by a finite trensition matrix ||p4(j)ll where 
py(j) is the probability of input letter i being 
received as output letter j (1 =1,2,...,a; 

f STL 2.0) and py(j) = 1. Equivalently, 
such a channel may be represented by a line 
diagram such as Fig. 1. 


The channel being memoryless means that 
successive operations are independent. If the 
input letters i and j are used, the probability 
of output letters k and 1 will be pq (k) P5(2). 

A sequence of input letters will be called an 
input word, a sequence of output letters an 
output word. <A mapping of M messages (which we 


Fig. 1 


may take to be the integers 1,2,...,M) into a 
subset of input words of length n will be.called 
a block code of length n. R= 4 log M will be 
called the input rate for this code. Unless 
otherwise specified, a code will mean such a 
block code. We will, throughout, use natural 
logarithms and natural (rather than binary) units 
of information, since this simplifies the 
analytical processes that will be employed. 


A decoding system for a block code of 
length n is a method of associating a unique 
input message (integer from 1 to M) with each 
possible output word of length n, that is, a 
function from output words of length n to the 
integers 1 to M. The probability of error for a 
code is the: probability when the M input messages 
are used each with vrobability 1/M that the noise 
and the decoding system will lead to an input 
message different from the one that actually 
occurred. 


If we have two given channels, it is 
possible to form a single channel from them in 
two natural ways which we call the sum and 
product of the two channels. The sum of two 


channels is the channel formed by using inputs 
from either of the two given channels with the 
same transition probabilities to the set of out- 
put letters consisting of the logical sum of the 
two output alphabets. Thus the sum channel is 
defined by a transition matrix formed by placing 
the matrix of one channel below and to the right 
of that for the other channel and filling the 
remaining two rectangles with zeros. If p, (35) 
and Ws (Jl are the individual matrices, the 


sum has the following matrix: 


p,(i). - p,(r) OF setae lbarveee® 

p, (2) Ra Soaks p, (r) Ome. EO 

0 0 p, (1) ae p,(r ) 
0 0 p,'(1). 4% p,i(r') 


The product of two channels is the channel 
whose input alphabet consists of all ordered 
pairs (i,i') where i is a letter from the first 
channel alphabet and i' from the second, whose 
output alphabet is the similar set of ordered 
pairs of letters from the two individual output 
alphabets and whose transitjon probability from 
(4,2°) to (j,5") is p, (5) py'(j ). 


The sum of channelscorresponds physically 
to a situation where either of two channels may be 
used (but not both), a new choice being made for 
each transmitted letter. The product channel 
corresponds to a situation where both channels 
are used each unit of time. It is interesting to 
note that multiplication and addition of channels 
are both associative and commutative, and that 
the product distributes over a sum. Thus one can 
develop a kind of algebra for channels in which 
it is possible to write, for example, a polynomial 
sak where the a, are non-negative integers a 
and K is a channel. We shall not, however, 
investigate here the algebraic properties of this 
system. 


The Zero Error Capacity 


In a discrete channel we will say that two 
input letters are adjacent if there is an output 
letter which can be caused by either of these two. 
Thus, i and. j are adjacent if there exists a t 
such that both py(t) and pj(t) do not vanish. In 
Fig. 1, a and c are adjacent, while a and d are 
not. 


If all input letters are adjacent to each 
other, any code with more than one word has a 
probability of error at the receiving point 
greater than zero. In fact, the probability of 
error in decoding words satisfies 


M-Jl n 
M 


where p_,, is the smallest (non-vanishing) among 


n 
the aOR n is the length of the code and M is 
the number of words in the code. To prove this, 


note that any two words have a possible output 
word in common, namely the word consisting of the 
sequence of common output letters when the two in- 
put words are compared letter by letter. Each of 
the two input words has a probability at least 
Pain of producing this common output word. In 
using the code, the two particular input words 
will each occur 1 of the time and will cause the 


common output 7 aan of the time. This output 
can be decoded in only one way. Hence at least 
one of these situations leads to an error. This 
error, Dedas is assigned to this code word, and 
from the remaining M - 1 code words another pair 
is chosen. A source of error to the amount 

4 Dee is assigned in similar fashion to one of 
these, and this is a disjoint event. Continuing 
in this manner, we obtain a total of at least 


a pee as probability of error. 

If it is not true that the input letters 
are all adjacent to each other, it is possible to 
transnit at a positive rate with zero probability 
of error. The least upper bound of all rates 
which can be achieved with zero probability of 
error will be called the zero error capacity of 
the channel and denoted by Co. If we let Mo(n) be 
the largest number of words in a code of length n, 
no two of which are adjacent, then Cy is the least 
upper bound of the numbers 2 log M,(n) when n 


varies through all positive integers. 


One might expect that Cy would be equal to 
log Mo(1), that is, that if we choose the largest 
possible set of non-adjacent letters and form all 
sequences of these of length n, then this would be 
the best error free code of length n, This is not, 
in general, true, although it holds in many cases, 
particularly when the number of input letters is 
small. The first failure occurs with five input 
letters with the channel in Fig. 2. In this 
channel, it is possible to choose at most two non- 
adjacent letters, for example 0 and 2. Using 
sequences of these, 00, 02, 20, and 22 we obtain 
four words in a code of length two. However, it 
is possible to construct a code of length two with 
five members no two of which are adjacent as 
follows: 00, 12, 24, 31, 43. It is readily 
verified that no two of these are adjacent. 
C, for this channel is at least $ log 5. 


Thus, 


Mies 2 


No method has been found for determining 
C, for the general discrete channel, and this we 
propose as an interesting unsolved problem in 
coding theory. We shall develop a number of 
results which enable one to determine C, in many 
special cases, for example, in all channelswith 
five or less input letters with the single excep- 
tion of the channel of Fig. 2 (or channels 
equivalent in adjacency structure to it). 
will also develop some general inequalities 
enabling one to estimate C, quite closely in most 
cases. 


We 


It may be seen, in the first place, that 
the value of C, depends only on which input 
letters are adjacent to each other. Let us 
define the adjacency matrix for a channel, Ayy, 
as follows. 


1 if input letter i is adjacent to j or 
if i= j 


O otherwise 


As; = 


Suppose two channelshave the same adjacency 
matrix (possibly after renumbering the input 
letters of one of them). Then it is obvious that 
a zero error code for one will be a zero error 
code for the other and, -hence, that the zero 
error capacity Cy for one will also apply to the 
other. 


The adjacency structure contained in the 
adjacency matrix can also be represented as a 
linear graph. Construct a graph with as many 
vertices as there are input letters, and connect 
two distinct vertices with a line or branch of 
the graph if the corresponding input letters are 
adjacent. Two examples are shown in Fig, 3, 
corresponding to the channels of Figs. 1 and 2. 


(o] 
Co 


OS oe 


10 


Theorem 1: The zero error capacity C, of 
a discrete memoryless channel is bounded by the 
inequalities 


- log min > Ay, PAP, $0, 8 
Py ij 

28 WALA 

x. 


= 7p, (3) = 1, pi) = 0 
Aj 


min C 
P, (3) 


ee 


where C is the capacity of any channel with 
transition probabilities p, (35) and having the 
adjacency matrix Ay; 


The upper bound is fairly obvious. The 
zero error capacity is certainly less than or 
equal to the ordinary capacity for any channel 
with the same adjacency matrix since the former 
requires codes with zero probability of error 
while the latter requires only codes approaching 
zero probability of error. By minimizing the 
capacity through variation of the py(j) we find 
the lowest upper bound available through this 
argument. Since the capacity is a continuous 
function of the Px (35) in the closed region 
defined by p;(j) > O, P, (3) = 1, we may 
write min instead of greatest lower bound. 


It is worth noting that it is only neces- 
sary to consider a particular channel in perform- 
ing this minimization, although there are an 
infinite number with the same adjacency matrix. 
This one particular channel is obtained as 


follows from the adjacency matrix. If A,, =1 
for a pair ik, define an output letter ~ j with 
(j) and p,(j) both differing from zero. Now 


Pp 

it there aré any three input letters, say ikl, 
all adjacent to each other, define an output 
letter, say m, with py(m) p,(m) p,(m) all dif- 
ferent from zero. In the adjacency graph this 
corresponds to a complete sub-graph with three 
vertices. Next, subsets of four letters or 
complete subgraphs of four vertices, say ikil nm, 
are given an output letter, each being connected 
to it, and so on. It is evident that any channel 
with the same adjacency matrix differs from that 
just described only by variation in the number of 
output symbols for some of the pairs, triplets, 
etc., of adjacent input letters. If a channel 
has more than one output symbol for an adjacent 
subset of input letters, then its capacity is 
reduced by identifying these. If a channel 
contains no element, say for a triplet ikl of 
adjacent input letters, this will occur as a 
special case of our canonical channel which has 
output letter m for this triplet when p,(m), 
P,,(m) and P, (m) all vanish. 


The lower bound of the theorem will now be 
proved. We use the procedure of random codes 
based on probabilities for the letters P,, these 
being chosen to minimize the quadratic form 


> Se Construct an ensemble of codes 
ij 


each containing M words, each word n letters long. 


The words in a code are chosen by the following 
stochastic method. Each letter of each word is 
chosen independently of all others and is the 
letter i with probability P,. We now compute the 
probability in the ensemble that any particular 
word is not adjacent to any other word in its 
code. The probability that the first letter of 
one word is adjacent to the first letter of a 
second word is a ree since this sums 


the cases of adjacency with coefficient 1 and 
those of non-adjacency with coefficient 0. The 
probability that two words are adjacent in all 
letters, and therefore adjacent as words, is 


. n ee a 
( rer Ay sha? ; . The eae ote adja 
cency is therefore 1 - ( Tj Ay ;P4P5) aerne 


probability that all M - 1 other words in a code 
are not adjacent to a given word is, since they 
are chosen independently, 


M -1 
a =e LPP.) 


which is, by a well known inequality, greater 
than 1 - ( M - 1)( er 4,,P,P,)", which in turn 


is greater than 1 - M ( 5 A, ;P,P,)”. If we 
set M= (1 -e)"( 2 « P.p)-2 
( e) ( ty AagPa } , we then have, 


by taking e small, a rate as close as desired to 
- log es A, 5FAF,- Furthermore, once e¢ is 


chosen, by taking n sufficiently large, we can 
insure that M( 2_ A, P.P.)™ = (1 - e)” is as 
15, gid, 3,3 


small as desired, say, less than 5. The 
probability in the ensemble of codes of a 
particular word being adjacent to any other in 
its own code is now less than 6. This implies 
that there are codes in the ensemble for which 
the ratio of the number of such undesired words 
to the total number in the code is less than or 
equal to 6. For, if not, the ensemble average 
would be worse than 6. Select such a code and 
delete from it the words having this property. 
We have reduced our rate only by at most 

log (1 - 6)-1, Since ¢ and 6 were both 
arbitrarily small, we obtain error-free codes 
arbitrarily close.to the rate-log 

min aS Ay P,P, as stated in the theorem. 

eS ij 


In connection with the upper bound of 
Theorem 1, the following result is useful in 
evaluating the minimum C, It is also interest- 
ing in its own right and will prove useful later 
in connection with channelshaving a feedback 
link, 


Theorem 2: In a discrete memoryless 
channel with transition probabilities p,(j) and 
input letter probabilities Py the following 
three statements are equivalent. 


1) The rate of transmission 


R= rn PyPy(j) log (p, (3)/ pm P.p,(5)) 


is stationary under variation of all non-vanish- 
ing Py subject to > ‘P, = 1 and under varia- 


tion of P, (5) for those p, (3) such that P,p,(j)>0 
and subject to 5 Py(j) = 1. 


2) The mutual information between input- 
output pairs I = log (py GY a PP(j)) ds 


constant, I,, = 1, for all ij pairs of non-vanish- 


ij 
ing probability (i.e. pairs for which P,P4(j)>0). 


3) We have p, (3) =r, a function of j 


only whenever P4Py (J) > 0; and also a Py =h, 


@ constant independent of j where S; is the set 


of input letters that can produce output letter j 
with probability greater than zero. We also have 
I = log h-l. 


The py(j) and Py, corresponding to the 
maximum and minimum capacity when the p;(j) are 
varied (keeping, however, any py(j) that are zero 
fixed at zero) satisfy 1),-2) and 3). 


Proof: We will show first that 1) and 2) 
are equivalent and then that 2) and 3) are 
equivalent. 


R is a bounded continouus function of its 
arguments P, and p, (J) in the (bounded) region 


of allowed values defined by Pas P= LF iP, 30; 


a p,(j) = 1, py(5) 20. 8 


R has'a finite 


partial derivative with respect to any p,(j)>0. 
In fact, we readily calculate 


oR 


5p, (5) = P, log (p,(5)/ o P P.(5)) 


A necessary and sufficient condition that R be 
stationary for small variation.of the non- 
vanishing p, (3) subject to the conditions given 
is that 


: OR = + 
Sp/G) bape) 


for all i, j, k such that P 
vanish. This requires that 


4 p, (5), p, (k) do not 


P, log P, (3) / = P Pad) 


P, log p(k) / > PuPat*) 


If we let Q; => Ppitd)» the probability of 
m 


output letter j, then this is equivalent to 


Py(J) By (ie) 


In other words, Ps (5)/Q, is independent of j, a 
function of i only whenever P,>0 and p,(5)> 0% 


This function of i we call O,. Thus 


P, (J) = a4, 


unless Pp, (3) = 0. 


Now, taking the partial derivative of R 


with respect to Py we obtain: 
(3) 
OR _ 2 


For R to be stationary subject to 2 

we must have Rie, = ot Pre: Thus 
> 7, (3) oe Pa) > 7, (5) roe Ped? 
J Q; J Q; 


Py =l 


Since for P,p,(j)>0 we have P,(5)/Q; =a,, this 


becomes 


& (5) log a = So aj) lo a 
Jj J 
log Oy = log Oy. 
Thus Oy is independent of i and may be written a. 
Consequently 
p, (5) 
85 
p, (J) 
og = log a= I 
%5 


whenever P,P, (35) >0. 


The converse result is an easy reversal of 
the above argument. If 


rog Py'5) = 1, then 
Q 
J 
oR/aP, =I-1, by a simple substitution in the 


or/OP, formula. 
variation of Py constrained by > a P, =1. 
Further, oR/ap, (J) sik OR/Bp, (i), and hence 


Hence R is stationary under 


the variation of R also vanishes subject to 
pm ep oe 
BM 
We now prove that 2) implies 3). 


log P, (J) = I whenever P,p,(j)>0. Then P, (5) 
Q 
J 


= se Qe a function of j only under this same 


condition. Also, if a, (4) is the conditional 
probability of i given* Jj, then 


Q5 a, (4) ae 
P,Q, 


Suppose 


ayia 


1 = 2— 4,1) = 
aac 


To prove that 3) implies 2) we assume 
p,(5) = r; 


when Psp, (5) >0. Then 


P,P, (5) 


guia) a; (1) 
P,Q, Sal oe aoe 


a 
Q5 1% P, 


= 3 (say) = 


Now, summing the equation Pid, = a, (4) over ieS; 
and using the assumption from 3) that a P, =h 
we obtain J 


hid gael 
J 


80 r; ieche sand independent of j. Hence I,, =I 


ij 
= log hen 


The last statement of the theorem concern- 
ing minimum and maximum capacity under variation 
of p, (J) follows from the fact that R at these 


points must be stationary under variation of all 
non-vanishing P, and p, (J), and hence the 


corresponding Py and p, (3) satisfy condition 1) 
of the theoren, 


For simple channels it is usually more 
convenient to apply particular tricks in trying 
to evaluate Cy instead of the bounds given in 
Theorem 1, which involve maximizing and minimizing 
processes. The simplest lower bound, as mentioned 
before, is obtained by merely finding the 
logarithm of the maximum number of non-adjacent 
input letters. 


A very useful device for determining Co 
which works in many cases may be described using 
the notion of an adjacency-reducing mapping. 


By this we mean a mapping of letters into other 
letters, i-a(i), with the property that if i and 
j are not adjacent in the channel (or graph) then 
a(i) and a(j) are not adjacent. If we have a 
zero-error code, then we may apply such a mapping 
letter by letter to the code and obtain a new 
code which will also be of the zero-error type, 
since no adjacencies can be produced by the 
mapping. 


Theorem 3: If all the input letters i can 
be mapped by an adjacency-reducing mapping 
i ?a(i) into a subset of the letters no two of 
which are adjacent, then the zero-error capacity 
C, of the channel is equal to the logarithm of 
the number of letters in this subset. 


For, in the first place, by forming all 
sequences of these letters we obtain a zero-error 
code at this rate. Secondly, any zero error code 
for the channel can be mapped into a code using 
only these letters and containing, therefore, at 


most eo” non~adjacent words. 


The zero-error capacities, or, more exactly, 
the equivalent numbers of input letters for all 
adjacency graphs up to five vertices are shown in 
Fig. 4. These can all be found readily by the 
method of Theorem 3, except for the channel of 
Fig. 2 mentioned previously, for which we know 
only that the zero-error capacity lies in the 
range + log 5 = Cy < lee 3. 


All graphs with six vertices have been 
examined and the capacities of all of these can 
also be found by this theorem, with the exception 
of four. These four can be given in terms of the 
capacity of Fig. 2, so that this case is essen- 
tially the only unsolved problem up to seven 
vertices. Graphs with seven vertices have not 
been completely examined but at least one new 
situation arises, the analog of Fig. 2 with seven 
input letters. 


As examples of how the No values were 
computed by the method of adjacency-reducing 
mappings, several of the graphs in Fig. 4 have 
peen labelled to show a suitable mapping. The 
scheme is as follows. All nodes labelled a are 
mapped into node gas well as aitself. All 
nodes labelled b and also 6 are mapped into node®. 
All nodes labelled c and y are mapped into node 
vy. It is readily verified that no new adjacen- 
cies are produced by the mappings indicated and 
that the @, §8,Y nodes are non-adjacent. 


Co for Sum and Product Channels 


Theorem 4: If two memoryless channels have 
zero-error capacities Cg = log A and Co = log B, 
their sum has a zero-error capacity.greater than 
or equal to log (A + B) and their product a zero 
error capacity greater than or equal to (ops + Co. 
If the graph of either of the two channels can 
be reduced to non-adjacent points by the mapping 
method (Theorem 3), then these inequalities can 
be replaced by equalities. 


a} 


“adjacency between words in the code. 


Proof: It is clear that in the case of 
the product, the zero error capacity is at least 
C, + Cy, since we may forn a product code from two 
codes with rates close to C, and C,, If these 
codes are not of the same length, we use for the 
new code the least common multiple of the indi- 
vidual lengths and form all sequences of the code 
words of each of the codes up to this length. To 
prove equality in case one of the graphs, say that 
for the first channel, can be mapped into A non- 
adjacent points, suppose we have a code for the 
product channel. The letters for the product 
code, of course, are ordered pairs of letters 
corresponding to the original channels. Replace 
the first letter in each pair in all code words 
by the letter corresponding to reduction by the 
mapping method. This reduces or preserves 
Now sort 
the code words into A® subsets according to the 
sequences of first letters in the ordered pairs. 
Each of these subsets can contain at most B® 
members, since this is the largest possible number 
of codes for the second channel of this length. 
Thus, in total, there are at most A?B™ words in 
the code, giving the desired result. 


In the case of the sum of the two channels, 
we first show how, from two given codes for the 
two channels, to construct a code for the sum 
channel with equivalent number of letters equal 
to AL -° + pl -6 , where 6 is arbitrarily small 
and A and B are the equivalent number of letters 
for the two codes. Let the two codes have 
lengths n) and nj. The new code will have length 
n where n is the smallest integer greater than 
both and . Now form codes for the first 


channel and for the second channel for all 
lengths k from zero to n as follows. Let k equal 
an, + bd, where a and b are integers and b < n,- 


We form all sequences of a words from the given 
code for the first channel and fill in the 
remaining b letters arbitrarily, say all with the 
first letter in the code alphabet. We achieve 
&t least ,K - 60 different words of length k none 
of which is adjacent to any other. In the same 
way we form codes for the second channel and 
achieve - 6D vords in this code of length k. 
We now intermingle the k code for the first 
channel with the n - k code for the second channel 
in all (%) possible ways and éo this for each 
value of k. This produces a code n letters long 
with at least z, (x) ak - 06 pn- k - n6 

=0 


= (aB)7 "(4 + B)" aifferent words. It is readily 
seen that no two of these different words are 
adjacent. The rate is at least log (A+B) - 6 
log AB, and since § was arbitrarily small, we can 
achieve a rate arbitrarily close to log (A + B). 


To show that it is not possible, when one 
of the graphs reduces by mapping to non-adjacent 
points, to exceed the rate corresponding to the 
number of letters A + B, consider any given code 
of length n for the sum channel. The words in 
this consist of sequences of letters each letter 
corresponding to one or the other of the two 


ONE NODE TWO NODES THREE NODES 


° ° ° oo oo oo ofa bwA 
° ° 
3 2 2 | 


FOUR NODES 
Qo @ Ga a 
° (OO o—o o—o £ 
rn -) ° ain a o——~ J ; = l ° a ] | N Xx 
By (epi te é 
4 3 iS 2 5 2 2 ine 2 2 I 
FIVE NODES 
°° Ce pe i) We 
cits s J ° I ~ on Ty As 
5 4% h's4 3 4 3 3 3 


Fig. ) ~ All graphs with 1, 2, 3, h, 5 nodes and the corresponding N, for chan- 
nels with these as adjacency graphs (note Cy = log Nj) 


Channels. The words may be subdivided into 
classes corresponding to the pattern of the 
choices of letters between the two channels. 
There are 2" such classes with (#) classes in 
which exactly k of the letters are from the first 
channel and n - k from the second. Consider now 
@ particular class of words of this type. Re- 
place the letters from the first channel alphabet 
by the corresponding non-adjacent letters. This 
does not harm the adjacency relations between 
words in the code. Now, as in the product case, 
partition the code words according to the 
sequence of letters involved from the first 
channel. This produces at most AK subsets. Each 
of these subsets contains at most B= ~ ~ members, 
since this is the greatest possible number of non- 
adjacent words for the second channel of length 
n-k. In total, then, summing over all values 
of k and taking account of the (i) classes for 
each k, there are at most >_ (2) ak pn - &k 

kok 


=(A + B)" words in the code for the sum channel. 
This proves the desired result. 


Theorem 4, of course, is analogous to 
known results for ordinary capacity C, where the 
product channel has the sum of the ordinary 
capacities and the sum channel has an equivalent 
number of letters equal to the sum of the equiva- 
lent numbers of letters for the individual 
channels. We conjecture but have not been able 
to prove that the equalities in Theorem 4 hold 
in general, not just under the conditions given. 
We now prove a lower bound for the probability of 
error when transmitting at a rate greater than Co. 


Theorem 5: In any code of length n and 
rate R> Co» Cy > 0, the probability of error P, 
will satisfy P,>(1 - Pee er R)) Pp cues where 
Pain 28 the minimum non-vanishing p,(j). 

Proof: By definition of Co there are not 
more than e2Co fone iecent words of length n. 
With R>Co, among e“" words there must, 
be an adjacent pair. The adjacent pair has a 
common output word which either can cause with a 
probability at least p 2. This output word can- 

-Inin 
not be decoded into both inputs. At least one, 
therefore, must cause an error when it leads to 
t_.» output word. This gives a contribution at 


-nR 
least e7? are to the probability of error Pg. 


Now omit this word from consideration and apply 
the same argument to the remaining eDR _) words 

of the code. This will give another adjacent pair 
and another contribution of error of at least 


e Daim The process may be continued until the 


number of code points remaining is just Seno. Gat 
this time, the computed probability of error must 


be at least (e2R _ .2 Oye eae 
=. one en(Co - R)) n 
min’ 


therefore, 


15 


Channels with a Feedback Link 


We now consider the corresponding problem 
for channels with complete feedback. By this we 
mean that there exists a return channel sending 
back from the receiving point to the transmitting 
point, without error, the letters actually 
received. It is assumed that this information is 
received at the transmitting point before the next 
letter is transmitted, and can be used, therefore, 
if desired, in choosing the next transmitted 
letter. ‘ 


It is interesting that for a memoryless 
channel the ordinary forward capacity is the same 


with or without feedback. This will be showr in 
Theorem 6. On the other hand, the zero errer 
capacity may, in some cases, be greater wit 
feedback than without. In the channel shown in 
Fig. 5, for example, C. = log 2. However, we 
will see as a result of Theorem 7 that with 
feedback the zero error capacity Cor = log 2.5. 


Fig. 5 


We first define a block code of length n 
for a feedback system. This means that at the 
transmitting point there is a device with two 
inputs, or, mathematically, a function with two 
arguments. One argument is the message to be 
transmitted, the other, the past received letters 
(which have come in over the feedback link). The 
value of the function is the next letter to be 
transmitted. Thus, the function may be thought 


4 he j +1 
of as Xe 4 f(k, v;) where x ria is the j 


transmitted letter in a block, k is an index 
ranging from 1 to M, and represents the 
specific message, and "x is a received word of 


length j. Thus j ranges from 0 to n - 1 and Ve 


over all received words of these lengths. 


In operation, if message m is to be sent 


f is evaluated for f(k -) where the - means "no 


word" and this is sent as the first transmitted 
letter. If the feedback link sends bafék a, say, 
as the first received letter, the next trans- 
mitted letter will be f(k, a). If this is 
received as Bp. the next transmitted letter will 
be f(k,aB), etc. 


Theorem 6: In a memoryless discrete 
channel with feedback, the forward capacity is 
equal to the ordinary capacity C (without feed- 
back). The average change in mutual information 
ae between received sequence v and message m 


for a letter of text is not greater than C. 


Proof: Let v be the received sequence to 
date of a block, m the message, x the next trans- 
mitted letter and y the next received letter. 
These are all random variables and, also, x isa 
function of mand v. This function, namely, is 
the one which defines the encoding procedure with 
feedback whereby the next transmitted letter x is 
determined by the te}ssage m and the feedback 
information v from the previous received signals. 
The channel betng memoryless implies that the 
next operation is independent of the past, in 
particular, Prly/x].= Prly/z,v]. 


The average change in mutual information, 
when a particular v has been received, due to the 
x,y pair is given by (we are averaging over 
messages m and next received letters y, for a 
given v): 


Va Ty vy fi Ty > 2. Pr[y,m/v]» 
y,m 
Priv,y,m] 


108 Privy ]Prim] 


log Priv,m 
Priv lPrim) 
Since Pr{m/v] = > Prly,m/v], the second sum may 
Mf 


Pri v,m 
be rewritten as > Pr[y,m/v] log Sey 


The two sums then combine to give 


FE = So aby) we Bf be 


ym 


>s Pr{m/v]¢ 


Pri y/v,m] Pr| v] 


= >a Prly,m/v] log a 


ym 


16 


The sum on m may be thought of as summed first 
on the m's which result in the same x (for the 
given v), recalling that x is a function of m 
and v, and then summing on the different x's. 
the first summation, the term Prly/v,m] is 
constant at Prly/x] and the coefficient of the 
logarithm sums to Pr{x,y/v]. Thus we can write 


2H 


x, 


In 


Pr{x,y/v] log Prly/x 


Als ue 


Now consider the rate for the channel (in the 
ordinary sense without feedback) if we should 
assign to the x's the probabilities q(x) 

= Pr[x/v]. .The probabilities for pairs, r(x,y), 


and for the y's alone, w(y), in this situation 
would then be 


r(x,y) = 4 (x) Prly/x] 
— s Prlx/v] Pr Ly/z] 
= Pr [x,y/v] 
wty) = 2 r(x,y) 


= Ee Pr (x,y/v] 
x 


= Prly/v] 
Hence the rate would be 
z Prly/x] 
R a r(x,y) log ae 
= Pry/x] 
= Pr[x,y/v] Me Ty Fe 


= AI 


Since R&C, the channel capacity (C being the 
maximum possible R for all q(x) assignments), we 
conclude that 


AI <C. 


Since the average change in I per letter is 
not greater than C, the average change in n 
letters is not greater than nC. Hence, in a block 
code of length n with input rate R, if R >C then 
the equivocation at the end of a block will be at 
least R- C, just as in the non-feedback case, 
In other words, it is not possible to approach 
zero equivocation (or, as easily follows, zero 
probability of error) at a rate exceeding the 
channel capacity. It is, of course, possible to 
do this at rates less than C, since certainly 
anything that can be done without feedback can 
be done with feedback. 


It is interesting that the first sentence 
of Theorem 6 can be generalized readily to chan- 
nels with memory provided they are of such a 
nature that the internal state of the channel 
can be calculated at the transmitting point from 
the initial state and the sequence of letters 
that have been transmitted. If this is not the 
case, the conclusion of the theorem will not 
always be true, that is, there exist channels of 
@ more complex sort for wnich the forward 
capacity with feedback exceeds that without feed- 
back. We shall not, however, give the details of 
these generalizations here. 


Returning now to the zero-error problem, 
we define a zero error capacity Cor for a 


channel with feedback in the obvious way-~the 
least upper bound of rates for block codes with 
no errors. The next theorem solves the problem 
of evaluating Cor for memoryless channels with 


feedback, and indicates how rapidly Cor may be 


approached as the block length n increases. 


Theorem 7 : 


In a memoryless discrete 


channel with complete feedback of received letters 


to the transmitting point, the zero error 
capacity Cor is zero if all pairs of input 


-1 
Otherwise Cy = log te 


Pee vo Ps 
ro) P A 
‘ ie j 


P, being a probability assigned to input letter 
ae Sm P, = 1) and 8; the set of input letters 
aL: 


letters are adjacent. 4 


where 


which can cause output letter j with probability 
greater than zero. A zero error block code of 
length n can be found for such a feedback 
channel which transmits at a rate 

R2Co (1 -2 log, 2t) where t is the number 


of input letters. 


The P, occuring in this theorem has the 


following meaning. For any given assignment of 
probabilities Py to the input letters one may 


calculate, for each output letter j, the total 

probability of all input letters that can (with 

positive probability) cause j. This is 
P,. Output letters for which this is 

1e j 

large may be thought of as "bad" in that when 

received there is a large uncertainty as to the 

cause. To obtain Pp one adjusts the P, so that 

worst output letter in this sense is as good as 

possible. 


We first show that if all letters are 
adjacent to each other Cor = 0. In fact, in 


any coding system, any two messages, say m, and 
i> Cam lead to the same received sequence with 
positive probability. Namely, the first trans-— 
mitted letters corresponding to m, and m, have a 


17 


possible received letter in common. Assuming 
this occurs, calculate the next transmitted 
letters in the coding system for m, and Mo. These 
also have a possible received letter in common. 
Continuing in this manner we establish a 

received word which could be produced by either 
My Or m, and therefore they cannot be distin- 


guished with certainty. 


Now consider the case where not all pairs 
are adjacent. We will first prove, by induction 
on the block length n, that the rate log P,-1 
cannot be exceeded with a zero error code. For 
n = 0 the result is certainly true. The induc-— 
tive hypothesis will be that no block code of 
length_n - 1 transmits at a rate greater than 
log Pg’, or, in other words, can resolve with 
certainty more than 


(Bol) log Pa pz (2-1) 


different messages. Now suppose (in contradic- 
tion to the desired result) we have a block code 
of length n resolving M messages with M > P52, 
The first transmitted letter for the code parti- 
tions these M messages among the input letters 
for the channel. Let Fy be the fraction of the 
messages assigned to letter i (that is, for which 
i is the first transmitted letter). Now these 
Fy; are like probability assignments to the 
different letters and therefore by definition of 
Po, there is some output letter, say letter k, 
such that oor F, #P.. Consider the set of 

€ 

k 


messages for which the first transmitted letter 
belongs to Sie The number of messages in this 


set is at least POM. Any of these can cause 


output letter k as first received letter. When 
this happens there are n - 1 letters yet to be 


transmitted and since M > P>™ we have PM> p(nel) 


Thus we have a zero error code of block length 
n- 1 transmitting at a rate greater than 


log Pies contradicting the inductive assumption. 


Note that the coding function for this code of 
length n - 1 is formally defined from the 
original coding function by fixing the first 
received letter at k. 


1 


We must now show that the rate log Po can 


actually be approached as closely as desired with 
zero error codes. Let Py be the set of probabili- 


ties which, when assigned to the input letters, 
give P_ for min max P,. The general 
fo) i 
P J 
i €5. 
J 


scheme of the code will be to divide the M 
original messages into t different groups 
corresponding to the first transmitted letter. 
The number of messages in these groups will be 
approximately proportional to Ph: Porece P.- 


The first transmitted letter, then, will cor- 
respond to the group containing the message to 

be transmitted. Whatever letter is received, the 
number of possible messages compatible with this 


This 


subset of possible messages is known both at the 
receiver and (after the received letter is sent 
back to the transmitter) at the transmitting 
point. 


received letter will be approximately PM. 


The code system next subdivides this sub- 
set of messages into t groups, again approximate- 
ly in proportion to the probabilities P,. The 


second letter transmitted is that corresponding 
to the group containing the actual message. 
Whatever letter is received, the number of mes- 
sages compatible with the two received letters is 
now, roughly, PeM, 


This process is continued until only a few 
messages (less than t“) are compatible with all 
the received letters. The ambiguity among these 
is then resolved by using a pair of non-adjacent 
letters in a simple binary code. The code thus 
constructed will be a zero error code for the 
channel, 


Our first concern is to estimate carefully 
the approximation involved in subdividing the 
messages into the t groups. We will show that 
for any M and any set of P, TP, = 1, it is 


possible to subdivide the M messages into groups 
of M) »Mo,-- +, such that m, = 0 whenever Py =0 


and 
pot 


x 2 St Ms aoe 


a <lL 
?,|s M 


We assume without loss of generality that 
PP. yee ePs are the non-vanishing P,. Choose my 


nm. 
eons 
u i 


Next choose 


to be the largest integer such that 


aT 


a 
ee = 
Let PY M 5. Clearly [5s w 


m, to be the smallest integer such that *2->, PS 


2 
m 
2 a 
-—_— = W = 
and let Ps M “Fy e have Isl w Also 
a 
[s, +8, | M’ since 8, and 8, are opposite in 
sign and each less than 7 in absolute value. 


Next, m, is chosen so that en approximates, to 
M 


wituines ato Pos 


7 3 If 8, +8,70, then 3 fs 


M 
chosen less than or equal to P3 Tf s, = §<0, 
then m3 is chosen greater than or equal to P5: 
M 


Thus again P, - "3 =§ <+ and 


al 
}5; +07 ES wy «(Continuing in this manner 


through oe we obtain approximationsfor 


1 


of Ir acs 


| 6 re Py 2a or 


Nu’ 


with the property that 


18 


| N(R, tyPp tee) 


- (m, +m, + 9 ni < 1. If we now define 
m, as M - 2. m, then this inequality can be 
written } ca 2: P.) a Ge) n,) | ¢ 1. Hence 


“ 
[ be ve I< :. Thus we have achieved the 
pe oe 


m 
objective of keeping all approximation "a to 
within © of Py and having 2 m, = M, 


Returning now to our main problem note 
first that if Po = 1 then Cor = 0 and the 


theorem is trivially true. We assume, then, that 
P,<1. We wish to show that BS “Se 


Consider the set of input letters which have 
the maximum value of P,. This maximum is 


certainly greater than or equal to the average 
+ Furthermore, we can arrange to have at least 


one of these input letters not connected to some 
output letter. For suppose this is not the case. 
Then either there are no other input letters 
beside this set and we contradict the assumption 
that Pi<i, or there are other input letters 


with smaller values of Py. In this case, by 


reducing the Py for one input letter in the 


maximum set and increasing correspondingly that 
for some input letter which does not connect to 
all output letters, we do not increase the value 
of P, (for any 8.) and create an input letter of 


the desired type. By consideration of an output 
letter to which this input letter does not 
connect we see that P| <1 . 


Now suppose we start with M messages and 
subdivide into groups approximating proportion- 
ality to the Py as described above. Then when a 


letter has been received, the set of possible 
messages (compatible with this received letter) 
will be reduced to those in the groups correspond 
ing to letters which connect to the actual 
received letter. Each output letter connects to 
not more than t ~ 1 input letters (otherwise we 
would have Pp = 1). For each of the connecting 
groups, the error in approximating P; has been 
less than or equal to aan Hence the total 


relative number in all connecting groups for any 
output letter is less than or equal to P, + 4 = i 


The total number of possible messages after 
receiving the first letter consequently drops 
from M to a number less than or equal to P.M +t-1, 


In the coding system to be used, this 
remaining possible subset of messages is sub- 
divided again among the input letters to 
approximate in the same fashion the probabilities 
Py. This subdivision can be carried out both at 


receiving point and transmitting point using the 
same standard procedure (say, exactly the one 
described above) since with the feedback both 
terminals have available the required data, 
namely the first received letter. 


- The second transmitted letter obtained by 
this procedure will again reduce at the receiving 
point the number of possible messages to a value 
not greater than Pp (P\M+t-1)+t-1. This 
same process continues with each transmitted 
letter. If the upper bound on the number of 
possible remaining messages after k letters is 
M, then M,, Ewes POM, ++t-1. The solution of 


this difference equation is 


wm, = ape + 2=2 


k fo) 1- Py 


This may be readily verified by substitution in 
the difference equation. To satisfy the initial 


conditions M| = M requires A = M - &£=-1. Thus 
1-P 
° 
the solution becomes 
6 ts} eas? 
Ben hace.) fo Sheer, 
k t al k 
= MP, + (1 -P 
0 Le—P: fo) ) 
<P tt (t= 1) 
S$ ° 


since we have seen above that 1 - P,> - 


If the process described is carried out 
for Dy steps, where ny is the smallest integer 


2d where d is the solution of MP.° = 1, then the 


number of possible messages left consistent with 
the received sequence will be not greater than 


Tete ts tel) < +? (since t ®1, otherwise we 
should have Cor = 0). Now the pair of non- 


adjacent letters assumed in the theorem may be 
used to resolve the ambiguity among these 

t~ or less messages. This will require not more 
than 1 + log,t® =log,2t? additional letters. 
Thus, in total, we have used not more than 

clap bas Log 2t° =a Log lt” =n say as block 
length. We have transmitted in this block 


19 


length a choice from M = rss messages, 
zero error rate we have achieved is 
1 


Thus the 


d log P~ 
R= +iog ue So, 
= d + Log,4t 


1 


L 1 
n 


= (1 == log ht”) log PS 


al 


=. (Le-geeloe bt”) Cor 


Thus we can approximate to Cop as closely as 


desired with zero error codes. 


As an example of Theorem 7 consider the 
channel in Fig. 5. We wish to evaluate Pp. It 
is easily seen that we may take PL = Py = Py in 
forming the min max of Theorem 7, for if they are 
unequal the maximum QJ P, for the correspond- 

7 


ing three output letters would be reduced by 
equalizing, Also it is evident, then, that 
Py = Py + Po» since otherwise a shift of 


probability one way or the other would reduce the 
maximum. We conclude, then, that PL = Py = Py 
= 1/5 and rien 2/5. Finally, the zero error 


capacity with feedback is log Bee = log 5/2. 


There is a close connection between the 
min max process of Theorem 7 and the process of 
finding the minimum capacity for the channel 
under variation of the non-vanishing transition 
probabilities Pp, (3) as in Theorem 2, It was 


noted there that at the minimum capacity each 
output letter can be caused by the same total 
probability of input letters. Indeed, it seems 
very likely that the probabilities of input 
letters to attain the minimum capacity are 
exactly those which solve the min max problem of 
Theorem 7, and, if this is so, the Pay = log PS . 


Acknowledgement 

I am indebted to Peter Elias for first 
pointing out that a feedback link could increase 
the zero-error capacity, as well as for several 


suggestions that were helpful in the proof of 
Theorem 7. 


% 
A LINEAR CIRCUIT VIEWPOINT ON ERROR-CORRECTING CODES 


David A. Huffman 
Department of Electrical Ingineering 


and 


Research Laboratory of Electronics 
Massachusetts Institte of Technology 
Cambridge, Massachusetts 


Abstract _ 


A linear binary filter has‘as its output a 
binary sequence, each digit of which is the 
result of a parity check on a selection of 
preceding output digits and of present and pre- 
ceding digits of the filter input sequence. The 
terminal properties of these filters may be 
described by transfer ratios of polynomials in a 
delay operator. If two binary filters have trans- 
fer ratios which are reciprocally related then 
the filters are mtually inverse in the sense 
that, in a cascade connection, the second filter 
unscrambles the scrambling produced by the first. 

The coding of a finite sequence of binary 
information digits for protection against noise 
may be accomplished by a binary sequence filter, 
the output of which becomes the sequence to be 
transmitted. (The inverse filter is utilized at 
the receiver.) Into the filter at the trans- 
mitter is inserted a sequence of information 
digits, immediately followed by another sequence 
of completely predictable digits consisting, say, 
of zeros. The completed block ofdigits is 
scrambled in a linear filter before transmission 
through the noisy channel. If this scrambled 
sequence were unaffected by noise in the channel 
the result of unscrambling by the receiver filter 
would be the original sequence of information 
digits followed by the all-zero sequence. If, 
however, channel noise has been added to the © 
sequence put into the receiver filter, then its 
output is the original sequence, plus the res- 
‘ponse of the receiver filter to the noise super- 
imposed thereon. In particular, the sequence 
positions which would have contained all zeros, 
had there been no noise, will now contain digits 
whose values are related to the sequence positims 
affected by the noise. These data may then be 
utilized for the subsequent correction of the 
errors which would otherwise have been caused by 
the noise. 


I_ Algebraic Description and Realization of Linear 


Sequence Filters 


A linear binary sequence filteis a 
synchronous filter whose inputs and outputs 
are ordered sequences of binary symbols (0's 
and 1's). For the general non-time-varying 


filter each digit of the filter output sequence 
is a modulo-two sum of an arbitrary selection of 
past output digits (Z) and of present and past 
input digits (X). The description of a sequence 
filter in terms of a delay operator, D, is a 
straightforward one. For example, a filter 
whose output Z is the sum of the first and third 
previous output digits and of the present, first, 
second, and fourth previous input digits is 


. described by 


Z=DZ+D°2+X+ DX + DX + DX (2) 


where the + symbol is used here for the modulo- 
two operation. That is, the present output is 
zero if an even number of selected digits have 
the value one, and is unity if an odd number 
have the value one. 

Since the modulo-two operation is self= 
inverse the terms in (1) may be rearranged to 
give 


p°z + DZ+2= DK + DK + DX +X (2-a) 


or 


3 4 2 


(D +D+I1I)Z= (D'+D° +D+I)X (2-b) 


The "transfer ratio" of the filter is then 
Z 
meus (3) 
x Dp? 


An efficient realization of this filter 
results from rearranging (1) to give 


X + Z = D(X+Z) + p*x + p°z + D’x (4-a) 


or 


X#Z= p{(x2) + of + of + ox} (4-b) 


*this work was supported in part by the Signal Corps, the Office of Scientific Research (Air Research 
and Development Command), and the Office of Naval Research, of the United States. 


The corresponding filter is given in Fig. l-a. 
The "inverse" filter, whose input is Z and whose 
output is X is described by Eqs. 4-a,b and has a 
transfer ratio 


7D sips T 
D°+D +D+tI 
Its realization is given in Fig. l-b. Both of 


the filters in Fig. 1 utilize only two kinds of 
elements: modulo-two adders and unit delays 
(single-stage shift-registers). The "chain" 
realization given both of these filters consists 
of a chain of unit delays with provision made 
for introducing the signals X, Z or (X+Z) 
between each two stages of delay. It uses just 
the number of delay units necessary to remember 
the input or output digit most remote in the past 
which is needed for proper operation of the 
filter (in this case the fourth previous input), 
and an equal number of adders. 

When a binary filter and its inverse are 
connected in cascade one mode of operation of 
the combination is that for which the transfer 
ratio is the identity operator. In our example 


Pepe 
Dee De DI 


po +p*+p+1). 
PaD41 


D 
That is, the second filter unscrambles the 
scrambling produced by the first. In the error- 
correcting scheme proposed in this paper the use 
of filters and their inverses will be. of para- 
mount importance. 


II Description of a Filter From Its Impulse 
Response Characteristics 


=I (6) 


Later in this paper we will want to make 
use of the fact that there exist finite realiza- 
tions of linear sequence filters whose response 
to an input "impulse" (a single digit 1 preceded 
and followed by infinite sequences of O's) is 
arbitrary except that it must eventually die out 
(become the all-O sequence) or ultimately become 
periodic. Suppose, for example, that we wish to 
realize a filter whose response to an input 
sequence, X*, containing an impulse is the out- 
put sequence, Z*, which ultimately becomes 
periodic. (See Fig. 2-a.) Z* can always be, 

considered to be the sum of tyo sequences: Z > 
the periodic component, and Ze the transient 
component. The filter we are trying to design 
may, for the moment, Pes considered to be made up 
of two sub-filters and f,, which have impulse 
responses Z* and 0 See and whose out— 
puts are added to Hive the desired response Z* 
(see Fig. 2-b). 

The filter fp could be realized by a cascade 
of two other filters (see Fig. 2-c). The first 
would have an impulse response which consisted 
of a sequence of impulses spaced seven intervals 
apart (the period of the periodic response) and 
continuing indefinitely. This filter would have 
a transfer ratio 


- which is a polynomial, I + D + D2 + 


21 


+ eee =e Te 
D +I 


Ene p! + pi4 +D (7) 


The periodically recurring output of this filter 
could be used as the input to another filter 
having the proper transient response (finite in 
length), ‘The latter filter has a transfer ratio 
in this 
example, whose terms correspond to the positions 


of the 1's in @ typical gyele, contained between 
commas, of the desired Z'. 

The transient part, Pz Ze of the impulse 
response is easy to arrange for in our example. 
The proper associated filter, f,, has a transfer 
ratio D. 


The filter we are designing could then be 
realized with a total transfer ratio of 


+ 
pe ys 
x” D742 


which may be rewritten as 


a 24 4 Wis 
2. Z+d+D+D) +00 +) ogy 


) (Ty Oo p4) +D (8-a) 


x" DY et 

or as 
Zk Deane I (8 ) 
tic 7. or 
x Dia For 


The numerator and denominator of ype Oyerceeien in 

Eq. 8-c each contain the factor D* + De +D+I 
ine using the Euclidean Algorithm; see reference 
1)) which may be cancelled to give 


Vite or +p°4p+r) (D4+D*4D+T) _ 


oe 
* 
X  (D*+D°+D+I) (D?+D+Z) 


= DeeD*4DIT (9 a) 


D?4+DH 


The transfer ratio is now in its simplest form and 
the filter may be synthesized as has already been 
done in Eqs. 4 and Fig. 1. 


III A Linear Single-Error Correcting Cod Scheme 
Consider the arrangement of filters show 
in Fig. 3-a. A sequence of seven X digits, is 


fed into a transmitter filter with transfer ratio, 
T, resulting in a sequence Z = (T)X which is 
transmitted through the noisy channel. In the 
channel a noise sequence, N, is added to Z so that 
what arrives at the receiver filter is 

Z=Z+N (9-a) 
At the receiver a filter inverse to the trans- 
mittes filter creates from the sequence Z' a 
sequence 


x! 


(r})z = (r7}) {2 + x] 
(r+) [(r)x + n|= X + (7) 


E there were no noise in the channel (N =O), 
X= X',. If there is noise present then the 


(9-b) 


sequence X! contains the sum of the transmitter 
input sequence, X, and the response (T-1)N of the 
receiver filter to the noise. If only a single 
noise digit is present the sequence X' contains 

X plus the impulse response of the receiver filter 
superimposed thereon. 

Let us examine the coding and decoding 
mechanism in more detail. The first four digits 
of the sequence X are information digits, and 
may therefore be chosen in 24 = 16 different ways. 
The remaining three digits are always all zeros 
and are to be called here buffer digits. The 
composite block of seven digits is scrambled for 
transmission in the channel by the first filter. 
The sequence X' which results from the unscrambl- 
ing action of the receiver filter would equal X 
if there were no noise in the channel. The clue 
to this possibility would be the existence of 
three zeros in the buffer positions (the last 
three digits) in the sequence X'. 

When a single noise digit is equal to unity 
(just one transmitted digit is changed by noise 
action) the received sequence, X', may look quite 
different from X. (See Fig. 3-b.) In particular 
the bu.ffer positions will no longer contain all 
zeros, but will instead be three successive digits 
of the impulse response of the receiver filter. 
In our example the impulse response of that filt- 
er is given in Fig. 3-d, and since we have assum 
ed the noise impulse to occur in the third 
position of the block of seven digits we observe 
in the buffer positions the third, fourth, and 
fifth digits of the impulse response. (See 
Fig. 3- Cyd). 

It is extremely important to notice that the 
digits in the buffer positions of the sequence X' 
are independent of which of the sixteen possible 
X sequences is sent. This pattern of digits 
depends only upon the position(s) of the noise 
digits and upon the impulse response of the, re- 
ceiver filter. 

We have chosen the receiver filter so that 
its impulse response has a period of seven digits 
(the length of the composite block) and so that 
each of the seven possible ¢ombinations of three 
(the number of buffer digits and the degree of 
the denominator polynomial) successive digits in 
the response will be different from the others. 
That this is possible for a block length of 
n = 20-1 with b buffer positions follows from 
the fact that the maximum possible period of the 
impulse response of a filter with denominator poly 
nomial of degree b is 29-1. (See reference 1.) 

By observing the three buffer positions of 
the sequence X', and by knowing the form of the 
impulse response of the receiver filter we can de- 
duce where the noise impulse occurred in the 
block. If we assume that only a single noise in- 
pulse was present (the most likely situation) we 
can recreate the original sequence X by adding 
(same as subtracting, modulo-two) the now known 
sequence (T-1)N to the sequence X'. 

For our example it is interesting to examine 
the sixteen possible sequences, Z, which corres- 
pond to the sixteen possible sequences, X, which 
might be inserted into the transmitter filter with 


T=D?+D° +1. These are listed in Fig. 4. 
The sixteen Z sequences are mutually separated 
by a distance of at least three, a Depa peony: 
condition for single-error correction‘™ 
The advantage of the linear circuit view- 
point of this paper is that instead of concern- 
ing ourselves with the distance properties of 
2k = 2m-b (4in our example, 16) different code ~ 
message sequences Z = (T)X, we may concentrate 
our attention on the impulse response of the 
receiver filter with transfer ratio T--. It is 
not claimed that this latter viewpoint will 
ultimately be more advantageous than the first, 
but only that two viewpoints are better than one. 
For single-error correction in a block 
of length n containing b buffer positions and 
k = n - b information positions we need only have 


_a@ receiver filter with an impulse response with 


period of length n with each b successive digits 
in that response different from each other sub- 
sequence of length b. This is possible for the 
case n = 2% - 1 and the proper polynomial is one 
of degree b which has a maximal-length "null 
sequence"(1) of 2D - 1 digits. Several of these 
are listed in Fig. 5. 


IV_A Multiple-Error Correcting Coding Scheme 


We now consider the coding of a block of 
seven digits, two of which are information digits 
and five of which are buffer digits. We shall 
use for the transmitter filter one having a trans— 
fer ratio T = (D@ + D + 1)/(D +1) and for the 
receiver filter gne with the inverse ratio 
T-1l = (D+ 1)/(D© + D +1). To test the propert- 
ies of this set of filters in the detection and 
correction of errors we shall be interested only 
in the last five digits (in the block of seven 
digits) of the receiver filter response to noise. 
Only if a noise pattern causes a distinctive sub- 
sequence of five digits to appear in the buffer 
positions of the X' sequence can this noise pat-— 
tern be recognized and corrected. 

The impulse response of the receiver filter 
is given in Fig. 6-a. In Fig. 6-b are listed the 
seven possible responses of this filter to single 
noise digits (single errors). Note that the 
last five digits of these patterns are all dif- 
ferent and that these noise patterns may there- 
fore be detected and corrected. 

In Fig. 6-c are listed the twenty-one 
possible double error patterns and the responses 
they produce in the receiver filter. (These may 
be found by adding, modulo-two, the proper single- 
error responses.) Three of the responses are, in 
their final five digits, the same as those pro- 
duced by certain single errors. Therefore when 
one of these (starred) sub-sequences is received 
in the buffer positions of X' it will be inter- 
preted as due to a single, rather than a double, 
error since the former is more likely than the 
latter. 

Of the thirty-one sequences of five digits 
possible (excluding the all-zero combination, 
which is interpreted as "no error") six of them 
cannot occur due to single or double errors. Those 


six are listed in Fig. 6-d along with the possible 
pairs of triple-errors which could cause them. 
Whenever one of these six sub-sequences is re- 
ceived in the buffer positions of X' two equally 
probable noise sequences (each containing three 
ones) are the possible, and the most likely, cawes, 
“Therefore in decoding it makes no difference 
which of these two we assume for the error patten 
The net result of the use of the filters de- 
scribed above is that thirty-two sub-sequences 
are possible in the last five digits of the 
sequence X', These correspond to no error and to 
the thirty-one single and mltiple-error patterns 
which are listed in Fig. 7-a. These thirty-one 
sequences constitute the "sphere" which surrounds 
the transmitted sequence (Z) = (0000000). The 
other three transmitted Z sequences are given in 
Fig. 7-b. These were determined by scrambling 
the other three X sequences in thg transmitter 
filter with transfer ratio T = (D“+D+tI)/(D+I). 
The "spheres" surrounding each of the remaining 
three Z sequences can be found by adding to each 
such sequence the thirty-one non-zero sequences 
of Fig. 7-a. ; 
The error probability associated with the 
coding method above is easily calculated. Let 
p be the error probability for a single digit in 
the channel, and q=1-p. Then since in the 
Sphere surrounding the transmitted message 
sequence the number of points at distance one is 
seven, the number at distance two is eighteen, md 
the number at distance three is six, the error 
probability is 


6 Dee 


P, = qe + 7q p + 18q°p* + 6q4p? (10) 
Slepian‘?) has shown that this is the minimum 


possible error probability fork = 2, n= 7. 


V_ Summary 


In Fig. 8 are listed the transfer ratios for 
transmitter filters which yield the minimum error 
probabilities listed by Slepien\3/, These 
expressions have been arrived at by a variety of 
methods and are not necessarily those which re- 
quire a minimum number of shift-registers for 
realization. Often alternate filters exist which 
use the same number of shift-registers for their 
realization and which code for the same error— 
probability as those which are listed. For 
instance, fork = 4 andn=8, T= (D4+1) /(D4+D+I) 
may be substituted for the T = D4 + D3 + I which 
is listed in the table. 

The general question of finding the most 
economical filter for minimum possible error- 
probability has not yet been satisfactorily 
solved and the author wishes to defer presenta-— 
tion of his fragmentary results until such time 
as they cohere more satisfactorily. 

It is clear that the method presented here 
can be extended to other than the modulo-two 
number system. For example, if we consider X to 
be a block of four ternary digits, - - - two 
information digits followed by two buffer digits, 
- - - then a transmitter filter with a transfer 
ratio 


23 


Z 
x 


will encode the block for transmission in a 
noisy channel. Keeping in mind that the + 
(modulo-three addition) operation is no longer 
self-inverse and that 1+ 2=2+1=0, it 
follows that 


2 


=D + 2+ 2 (11-a) 


D°x + 2DX + 2x 


Z = (11-b) 
and 

X+22+Z=X+ 22 +D°x + 2x + 2X (L-c) 
and 

X = DX + 2K + 22 (11-4) 


From these expressions it may be found that the 
impulse response of the transmitter filter is 


- = - 000Q22100000000 - - - (12-a) 
and that of the receiver filter is 
-— — 0009211012292] 10122021101- - - (12-») 


The sixteen encoded sequences, Z, in Fig. 9-a 
may then be derived and the noise patterns due 
to errors in single positions may be calculated 
to be those of Fig. 9-b. Note that the last 

two (buffer) positions can contain one of the 
eight pessible (non-zero) combinations of two 
ternary digits corresponding to the eight possible 
kinds of single errors which may be detected and 
corrected. Thatthis is possible is due to the 
fact that the null sequence associated with 

D* + 2D + 2 is of maximal length: 32-128. 
In this case, again, we have chosen to concen- 
trate our attention on the impulse response of 
the receiver filter rather than on the somewhat 
more elusive "distance" properties of the encod- 
ed messages. 

In summary, the viewpoint introducec here 
suggests that, instead of thinking about the 
distance properties of message points in an 
n-dimensional space, we may profitably think of 
designing a linear binary sequence filter at the 
receiver whose impulse response is of such a form 
that, by viewing b =n ~ k successive digits of 
it we distinguish sub-sequences due to single 
errors, by viewing b digits of two superimposed 
impulse responses we may distinguish sub-sequences 
due to double errors, etc. (Regardless of how 
complicated the desired impulse response may be 
we are certain that there exists a corresponding 
linear sequence filter.) The corresponding mes- 
sages encoded by the transmitter filter have the 
.property that, when they are fed into the , 
receiver filter in its "rest" state,they leave 
the filter in its "rest" state at the end of the 
message sequence. Finally, we might propose the 
question: For a given n and k, and for a fixed 
number of shift-registers what transfer ratio 
should the receiver filter have to minimize the 
associated error probability? 


References: (2) R.W. Hamming, "Error Detecting and Error 
Correcting Codes", Bell System Technical 
Journal, pp. 147-160, 1950 

(1) D.A. Huffman, "The Synthesis of Linear 


Sequential Coding Networks", Proceedings of (3) D. Slepian, "A Class of Binary Signaling 
the Third London Symposium on Information Alphabets", Bell System Technical Journal, 
Theory, September 13, 1955. pp. 203-234, 1956. 


(d) 


Fig. 1 - Chain realization of a binary sequence filter and its inverse. 


Fostynsoe6O O10g::020'02010.0, 05020 0:.0°0 8" 0r0lG al0r0 Oo eae 
Zt Fodhe90 0062 051209%0/0 1 ‘FT 0 15010 rOmCLON ee 
2 “ssi. eat 0. ONOSL ASLO "F O:0,0 1 0-1 “OO nome Ones 


ore ie oe ORONO, ORL LO FORO FORO TORORO! OF OF OTORORORORORORONOM mms 


(a) 


I+D+D“+D 


(c) 


4 


Xe 
7% 


Fig. 2 - Steps in the synthesis of a binary filter from a specified 
impulse response. 


2h 


‘ t 
' 


' 
<—— Transmitter ——->}¢—__—- Channel —________»€¢____ Receiver —___-» 


N ---- 


T=D* +D7+1 aes 


Noise, N 
(a) 
X; (1110) 000 7Al akaal ah (ojetap al (0) 
N; 0010000 
Zo LP 0) 0) F080 2.5, -(5), (8) ate (0) al ak al 


(d) 


x: -(1.22 0)'0'0'0 Impulse response of the receiver filter with transfer ratio 
thy oo poe) 2 -1, 
(Pax)uNe RO 0.10 182 Toa(Ding Batali: 


-l 
Xi = Xt (Te) N:P 1.360505 218 » « « 0:0 0F0)O,ar 0115 151°0 O10 91 110 One 
(c) (a) 


Fig. 3 = An elementary example of the linear single-error detecting scheme. 


00000 
00 1!0 
010)0 
011)0 
10010 
10 1)0 
11010 
1 1/0 
0 0 0j0 
00 1/0 
01010 
011lo 
0 0,0 
110100 
1110),000 

Liiilooo —________> 


PrPPRrPPRrHPrH HP OOOO COO © 
PFPrPrPRrPOOCOOOrFPrFPKRPRHOOO Oo 
COrRFrFOOrRPKRP HH OOKrFPRPO OO 
rFPOrPOOrFPOrPOrPOKRH ORO 


OLOOsOMOsOnOPOLOroroLoTo 
CORPPrFPrFOOOOrFKFKrFHO O 


PrRPRHPHPHOOC OO 0 Oo 
re 
SCOCOC COO OOOO ea o 


=) 


to) 
oy a | fey fey {i [a key oy fob ja) (2) fey fe) [ot {os} 


Le de oe il oy) key te oye | ey |) fel (ok Keo) 


Fig.  - Coded sequences for single-error correction (n = 7). 


25 


Do +p? +1 


DY +D+I 


peo+D +1 


D waded 


De De ay 


po+p? +1 


Fig. 5 - Polynomials having maximal length mull sequences. 


(a) Impulse response of filter 


@) o sy OLOLOCL O12 2 OPPS OO ee ete 


(b) Response of filter to single errors 


(c) Response of filter to double errors 


*HH OOH OH OOH OHHOHOHOAORd 


AGOGO ACA ARORA OCOOHnR OA ACKHHOre 
AHAOCHAOOCOORAOAAACnKAAOOAHHAO 
OCOAOAR A AACR AAR AOCOCOOAAHOOO 
AGW AA AA OOCOCOnAnAARAROOCOOCOOOO 


Cocco HoOOCOoOnoOOOnOOnOAA 
SCo0oOCOCOTFROCOOCOCHOOCOOCOnFAOOHOHOR 
COCO HROOOCOOTAGOOHAHOOHOOHAHO 
COHOOCOOOHAOOOCTRHOOOnAHOOO 
SCHOVDWVDCOHODVOANAAAHOOOCOOO 
ADDDODOOHAHAHHOOOOOOCOONO 
AddAnixA A OOCoOoOoCOCOOCOCOCCO0OOo 


(d) Response of. filter to certain triple errors: 


AdDGOHAGDOOOdA 


COHADIOOOCOORA 
CO0OHADOAAdAA 
CODDCOKAAAAA 
Ad dAAnR Ad AAA AA 


e°coo°0co0oo0co0o0eo 
CAODOTHOHOCHAOHOAR 


PD OP PDO DA — 
SGHADOAHOOARAHOOR 


AQDOHHAOCOHOAAO 
HAOHOOAAHOOHOR 
CAOHOHAOCOHAOKHO 
HAOHOHOAOAHOKAO 
(oho koko oho nonononokone) 
OHOAHOAHOHAOHORA 


b=] 


Fig. 6 - Error response of (D + I)/(p2 +D+4I). 


26 


(90000000 
1000000 
0100000 
0010000 
0001000 
0000100 
0000010 
o000001 


° 
4 
oe 


5 ep 6B Sb Ss BP BB BB 


For 


For 


Byes ob 


) 


1.0.0 '170' 0:0 
1000100. 
1000010 
1000001 
0103.000 
“010.0 2.0°0 
01900010 
0100001 


0011000 
0010100 
0010010 
0010001 
0001100 
0001010 
0001001 
0000110 


0000101 
0000011 
0 07150) LTO 
0010101 
oo01l10011 
0011100 
0011001 
0011010 


(a) Shape of "sphere" surrounding transmitted sequences 


xX: 


Fig. 7 - Results of coding with binary filter for k = 2, n 


2: 


Fig. 8 - Transmitter filters for minimum 


14(0? 


De +D+I 


(D7 + D + 1)/(D + 1) 
1/1) 

(p* +D+1)/(D + I) 
1/(p* + D + 1) 

(D? +D+ 1)/(D* + I) 
1/(D* + D + I) 
1/(D* +D+1) 

(p? +D+ 1)/(p* + I) 


De +D+2 


+p? +1 

po+p +t 

(p* + 1)/(p* + D? + 1) 
+ D+ I) 

1/(p? +D + I) 
1/(D? + D +1) 


D*+D+I 


DP +D° +I 
po+p +2 


0700 01010), = BO 
0100000 
10000600 
-1100000 
2 Dept 
tp 
+1 


For 


For 


For 


For 


27 


0000000 
90101111 
AR (e}abal ahah ab 
1110000 


(b) Derivation of the four transmitted sequences 


k 


5 6 


BB SBS BD RF 


pb Bp Bp Dp RF 


error=probability. 


(or 1001001) 
(or 1001010) 
(or 1001100) 
(or 1000011) 
(or 1000110) 
(or 1000101) 


X: 


iS) te) (ey ter fe) Te) (ei ie) 
(ey (ey {ey [Ss fe) fy rey (2) 
(oy) fo {eo} (oy tS Le Kon fe) 


(0 
(0 
(0 
(1 
(1 
(1 
(2 
(2 
(2 


0) 
1) 
2) 
0) 
1) 
2) 
0) 
1) 
2) 


O'30.40 OC © © © O So 
(2) te) Ye) ey (ed (ey sek ey Te! 


(ay fe te) fey (oy er U8) 


sss | ESC) Or ror 
OFZezae 
Oya Bk 
2-25.06 
yah al (a) Derivation of 
220822 sequences to be 
Tie2e 0 transmitted. 
1012 
120:2 
—_—_——_—_———_- (rm: oolo2 
0001 
00,21 (b) Response»of receiver’ 
O'Oue filter due’ to errors 
0 21 ay in single positions 
01/22 within Block. 
2 11 (¢) 


Fig. 9 - Encoded sequences an error patterns for a ternary linear 
coding scheme; T = D© + 2D + 2. 


28 


THEORY OF INFORMATION FEEDBACK SYSTEMS 


Sheldon S. L. Chang 
New York University 
University Heights 
New York 53, N. Y. 


Abstract 

A general information feedback system is de- 
fined and formulated in a way broad enough to al- 
low coded or uncoded channels with total or par- 
tial information feedback. Basic theorems gov- 
erning change in information rate and reliability 
are derived with full consideration of the tran- 
sition probabilities of both direct and feedback 
channels, including message words as well as the 
confirmation = denial] signal. 


The process involved in this type of feed- 
back is as follows. There are a set of(x}words 
which are transmitted and a set of { y} words 
received. The set {xX} is divided, as previously 
agreed upon into{X{ groups, such that x; j is 
the jth member of the ith group. Corresponding 
to the transmission of x;, the word yy141,1 may 
be received. The reception of this word indi- 
cates that x414:1€X,1 was transmitted. The re- 
ceiver therefore sends back to the transmitter 
the message Y;1 to indicate the belief that a 
message from the group X41 was sent. Due to noise 
in the feedback channel the message Zjn is re- 
ceived. If Zs» corresponds to X;, the sender 
confirms the report by sending the subsequent x. 
If Z;» does not correspond to Xj, the sender 
transmits the denial signal. It is assumed that 
while the feedback channel is reporting, the di- 
rect channel is sending new information siml- 
taneously. An example is given which shows how 
this is accomplished with delayed feedback. 


The following theorem and corollary 
are derived: 


Theorem I. The gain in information at the 
receiver by the confirmation - denial pro- 
cess is less than the average signal entropy 
required for the confirmation - denial 
transmission by 


= P(X; Yq" ) Huge 


where P(Xj, Yy:) is the joint probability 
that a word from the Xj group is trans- 
mitted and a word from the Yy1 group is re- 
ceived, and 


Hygt= - Pays log Pait - (1-Py4")log(1-Py4') 


is the entropy associated with the proba- 
bility Pyqtof the feedback channel. This 
is the probability that when a word from 
the Xj; group is sent and a word from the 
Yi! group is received, the receiver will 
obtain confirmation through the feedback 
process, H441 is therefore the uncertainty 
of confirmation or denial when a word from 
the X; group is sent and a word from the 


Y4z: group is received. 


Corollary I. The gain in information at the 
receiver due to the reduction in equivo- 
cation by the confirmation - denial process 
is equal to the net information of the 
confirmation - denial signal when the feed- 
back channel is error-free. 


A special case of information feedback 
systems is the discarding system in which a wrong 
message is corrected by erasing and then re- 
peating. If the feedback message indicates that 
an erasure was incorrectly received as a mes— 
sage word, the sender transmits two erasures to 
erase both the message word and the preceding in- 
correct message. When the feedback message in- 
dicates that a message has been incorrectly re- 
ceived as an erasure, the sender transmits, not 
another erasure, but simply the preceding mes- 
sage and the intended message. 


In a discarding system there is a natural 
iterative process which greatly reduces the error 
probabilities of the confirmation - denial signal. 
The information that a particular message is con- 
firmed or denied is not entirely contained in the 
absence or presence of an erasure signal. Each 
succeeding message is a confirmation of a pre- 
vious choice. 


Theorem II. In a iterative discarding 
system, the error probabilities of the con- 
firmationedenial signal are successively 
reduced by the iterative process. If the 
feedback channel is error free, and if the 
feedback group containing the erasure signal 
contains it as the one and only signal, the 
error probabilities decay to zero at ap- 
proximately an exponential rate with each 
subsequent confirmed signal. Otherwise they 
converge rapidly to constants. 


A third theorem deals with the equivocation 
of the finally accepted signal. 


Some deductions from the basic theorems are 
listed below: 


(1) With error-free feedback of sufficient 
capacity to repeat the whole message as it is 
transmitted, information can be sent error-free 
through a noisy channel with little or no loss of 
the net information rate and without coding. 


(2) With an equally noisy feedback channel, 
feedback is by far more effective than coding in 
error reduction. By effective is-meant that 
there is no need of elaborate coding nor serious 
loss of channel capacity. 


ey 


Classification of Feedback Systems 


The utilization of feedback in a two way 
communication system to improve message relia- 
bility is a natural process which occurs at least 
millions of times daily. During a telephone con- 
versation, while A is talking to B, B may ask A to 
repeat upon hearing an indistinguishable word, or 
A may ask B to repeat if the message is important 
or without redundancy and mst be reliably trans- 
mitted. Aside from feedback relating to the in- 
mediate word or sentence, A may talk at a slower 
rate if he is asked to repeat too often or if B 
makes mistakes too frequently in recital. Alter- 
nately B may ask A to talk slower if what he 
hears is not clear enough. 


The above natural processes are not effi- 
cient in terms of information rate or improvement 
of reliability. However, they embody the essen- 
tial principles upon which efficient feedback com- 
munication systems may be devised. 


: There are two distinct functions of the 
feedback process, as illustrated above: 


(4) improvement of the immediate message 


(ii) matching of the communication rate 
to the noise level. 


To devise means for fulfilling the second function 
is essentially an instrumentation problem. Its 


very existence depends on the assumption that the | 


signal to noise ratio is a slowly varying func- 
tion. Consequently, it is allowable to assume 
that within any given interval, which may contain 
many words duration, both the signal to noise 
ratio and the communication rate do not change ap- 
- preciably. Once the basic laws governing the 
information rate and reliability under constant 
signal to noise ratio and communication rate are 
established, it is always possible to design a 
system which automatically adjusts for the optimum 
conditione 


As a preliminary study this paper will be 
concerned with the more basic problem, that of 
{improvement of the immediate message by means of 
feedback, while signal to noise ratio and com- 
munication rate are held constant. 


Generally speaking, there are two types of 
feedback systems which improve the immediate mes- 
sage: 


(1) decision feedback 
(2) information feedback. 


In case (1), the sender adds redundancy or 
coding into the message 30 that out of a total of 
N possible sequences, only M are selected as 
alphabets for conveying information and are trans- 
mitted. Of the remaining N-M sequences, there may 
be M' sequences such that each of the M' is close 
enough to one of the M and far away from the other 
M-1 in signal space that it can be interpreted as 


the former without much risk of being in error. 
Upon reception of one of M or M', the receiver 
will record it and report "+" which means 
"please proceed". If one of the N-(M+M') is re- 
ceived, the receiver will not record and report 
"" which means "please repeat", and the trans- 
mitter will then repeat the information. 


There is no need for the transmitter to wait 
while receiving the report from the receiver. It 
may transmit subsequent information at the same 
time if, for instance, the agreed arrangement is 
that at a "—" report the next two sequences are 
subsequent informations while the third is the re- 
quested repetition, or similar arrangements. 


For this system to function well, the feed- 
back channel must be able to transmit reliably one 
bit of information during the interval that the 
direct channel transmits one sequence. Al- 
ternatively, the feedback channe] may be required 
to report the positions of received sequences be- 
longing to the N-(M+M') after each group of se- 
quences instead of reporting "+" or "=" after each 
sequence and the required feedback channel ca- 
pacity would be even smaller, 


The essential problem in this type of system 
is to code the direct transmission channel into 
codes which allow for nulls or rejected infor- 
mation.* The feedback is simply a means of 
filling the null positions with additional infor- 
mation from the direct transmission channel. If 
the feedback channel has negligible error, which is 
not difficult to obtain as the required channel 
capacity is low, the information rate and the 
probability of error of the direct transmission 
channel will remain unaltered. 


Hereafter case (2), information feedbacks 
will be considered. 


I. Information Feedback, Definition, 


General Theorem 


In an information feedback system,,the re- 
ceiver reports back in whole or in part'the re- 
ceived information and the sender will decide 
whether or not he is satisfied with the infor- 
mation as received, and in the latter event, he 
will send corrective information. 


In calculating channel capacity, it is as- 
sumed as in case (1) that while the feedback chan- 
nel is reporting, the direct transmission channel 
is sending new information simultaneously so that 
no standby period will be allowed for, x 


The basic process involved in this type of 


_* For instance, Reanineeet ante error detecting 


30 


single error correcting code, and others as given 
in reference 2. 
sis This point will be illustrated later in section 


feedback is as follows: There are a set of { x} 
words which are transmitted and a set of {y 
words received, The set of { x} is divided, as 
previously agreed upon into eee ee such that 
Xij is the jth member of the i group. Corre- 
sponding to the transmission of x44; the word 
Yi'j'k' may be received. The reception of this 
word indicates that xj:j1€@ Xj was transmitted. 
The receiver therefore sends back to the trans- 
mitter the message Yj! to indicate the belief 
that a message from the group xj: was sent. Due 
to noise in the feedback channel: the message 2j" 
is received. If Zjn corresponds to Xj, the 
sender confirms the report. If Zjn does not 
correspond to X4, the sender transmits the denial 
Signal. 


The above processes have the following 
physical significance. 


1. There is an improvement in reliability, 
Since only confirmed messages are finally re- 
tained e 


2. There are two opposite changes in infor 
mation rate. The sender has to provide addi- 
tional capacity to keep the receiver informed as 
to whether the reported message is correct. ~ 


However, there is additional information 
gained by the receiver in knowing whether or not 
the original message is in the reported group. 


3. The feedback channel may have consider- 
ably lower capacity than the direct transmission 
channel since each group may contain many words, 
yet the improvement in reliability can still be 
substantially realized. It is generally possible 
to assign the transmitted words in such a way 
that words in the same group are far away from 
each other in the signal space. Once confirmed, 
there is little probability of error, as the 
other words in the same group are very unlikely 
to be received by mistake. 


In the following, a quantitative study will 
be made on the above mentioned effects. 


After receiving Yu» jkr, the posteriori 
probability that the transmitted word was x;j is 
P. is gtice (4 5) 6 After the receiver reports Yi, 
the tonditiénal probability that the sender will 
receive Zi" is Py;,(Zj"). When 24" corresponds 
to Xj, and the sender confirms the report, there 
is a probability P,.(d) that the confirmation is 
received as a denial. Similarly, when Zj" does 
not correspond to X4, there is a probability 
Pa(c) that the denial is taken as a confirmation. 
There are, of course, the probabilities P,(c) and 
Pa(d) that the confirmation and denial signals 
are correctly received. From these definitions 


Pe(c)?P,(d).= 1 
Pa(c)+Pa(d) = 1 (1) 


(2) 


and TuPy, , (Zn) el] 


31 


There are various ways of transmitting 
the confirmation denial signal. One method of 
special interest is for the sender to transmit 
the denial signal only. When he transmits the 
subsequent message, confirmation of the previous 
report from the receiver is automatically im- 
plied, While this particular method will be 
discussed in detail in subsequent sections, the 
discussion here is kept at a general level, 
without any reference as to how the confirmation 
denial information is conveyed. 


Upon receiving Yurjtke the probability 
that xj; was sent and a feport of Yj: will be 


- confirmed is 


Pygrjrke (5) Pais (3) 
where P34; is defined as 
Pag '=Py; (2; )Pe(c)+[-Py; (24) ]Pg(c) (4) 


and the feedback messages Z; correspond to X;. 
Pi41 is the conditional probability that the 
feedback word Y;1 will result in confirmation, 
when xij€ Xj; was originally transmitted. 


Upon receiving Yat jks the probability that 
Xjj was sent and a report of Y41 will be denied 
is: 


Pyar grr yg 1-Pyg td (5) 


The total probability of confirmation upon re- 
ceiving yi!4t,t is therefore 


Pygr grr (©) == a Pia P ys yn (Aig) (6) 
while the total probability of denial is, 


Pyar gry (4) = & PCPs Py 0 5ryer (Lg) 
(7) 


The conditional probability, upon receiving 
both Yi'j'k! and the confirmation of -Y31, that 
x44 was originally sent is 


Pitt Pys esis (X45) 
Pe Y4tgjiyet (xij) J 
Post jt (©) 
(8) 
and similarly, the conditional probability upon 
receiving both Yirgrkt and the denial of Y;, that 


X44 was sent is (1-Pag Py 1 ger OL) 


Pa yar gras 43) 
Pyar grt (4) 
(9) 


The equivocation after receiving both 
Yirgrkt and subsequent confirmation is therefore: 


Heyy g1yr (X)= -2 - Poygr gt yt (Xi) x 
@ 


log Poyat gry (x43) 


By substitution and performing the summation 
where possible, 


(10) 


Heyar grit (x)=LoBP ys yey, (c)- 2 = Poyir yr (x45) 


} log Pyy1 + log Pyar gigs 49) j 


(11) 
Similarly, 
Hayes gre (x) = log Seip — 
= ae Fay iste (x5) X 
{oe (1-Py41 #208 Pras srics (x13)} (12) 


On the average, the total equivocation is re- 
duced by the confirmation - denial process by 
the amount 


AT sry Zz Mya gnct Var grkt (oH eys srt (x)- 


(x) 


i Pyar gaye D Hays yi: 3) 


whence, by substitution, and noting that by 
definition the equivocation is, 


By a0 gis 3 >= = Pye gues ‘(xy 4) X 


log Pyst gnict (x, 5) (14) 


the result is obtained that 


(c)log (c) 


ANS 5141 ar Fyn jnet PYAN gt! 


(da) log P. 


Yi' gk (4) 


~ Pye grit 


+ = = Pyar gts (af Pas log Py; 


+ (1-Pyq!) log (1-Pygs) } (a5) 


The signal entropy for the confirmation - 
denial process is 


32 


Hygt = = Py log Pyy1-(1-Py4')log(1-Py4") 


By graes C8) =~ Pras gris ©) 208 Fras gn - 
. Pysn gigs (2) log Pyat snk! (d) (16) 


while an uncertainty entropy H;; may be defined 
AS, 


(17) 


The reduction in the total entropy of the re- 

ceived message by the confirmation denial proc- 

ess is then concisely given as 

Alysy srt a Aya grat 

- Poe H 
a Vir grict (xq 4) ii! 


(ed) 
(18) 


This relation shows that the average gain 
in information by the confirmation - denial 
process is less than the entropy of the code 
required for the confirmation - denial process. 
The net difference is 


Hy 4 ger (C4) = ATryn 


“> > ae 2 
i 3 of 


ae (19) 


(x, 5) Hyqs 


The symbol ae 
i'j'k! 

summation or an integration over the received 
Signals, according to whether the received sig- 

nals are discrete or continuous. The proba- 

bility that the received signal is Yq tke is 

P(vgrgrxt) 3 y is discrete or P(Ys1 tk! dk’ 

ie 


will indicate a 


f y is continuous. Hence‘the differ- 
ence average over all y,, jtkt is 
H(cd) -AI= (20) 


en 2 a P(¥40 51K YPyst gues (x4 5)Hait 


where H (cd) is the averaged entropy of the code 


required for confirmation - denial. By com- 
bining 
H(ed) -AI 
z 2 
ce ij PCa 3 97gr grit) Bass (21) 


Where P(y. 1gtkis 14) is the joint probability 
that x; is sent and yi14i,1 is received. 
Ponfocal id the summation over j',k', and j 


PE P(x4 4 ,¥41 51K) = P(X;, Yh) 
J'k'j (22) 
ia the joint probability that a word from the 
X, group is sent and a word from the Y; group 
is received. - 


Thus the net difference is 


Pal 
ii! 


Restated as a theorem the results are: 


P(X3» Yy1) Hyge (23) 


Theorem I The gain in information at the 
receiver by the confirmation denial process 
is less than the average signal entropy 
required for the confirmation - denial 
transmission by 


= P(X,Y; 


2 Hy 51 


where P(X;, Yz1) is the joint probability 
that a word from the X; group is trans- 
mitted and a word from the Y; group is 
received, and 


Hyqr= - Pyqrlog Pyq!-(1-Pi4!)log(1-Py4") 


is the entropy associated with the proba- 
bility Pj4! of the feedback channel. This 
is the probability that when a word from 
the X; group is sent and a word from the Y,/ 
group is received, the receiver will obtain 
confirmation through the feedback process. 
Hy41 is therefore the uncertaintly of 
confirmation or denial when a word from the 
X4 group is sent and a word from the Y; 
group is received. 


Two Special Cases An interesting case is 
when the feedback channel is error-free. For- 
mally, 

Py, , (24) _ ry iit 


and, Py41 = O44! Pe(c) + (1- $441) Pqa(c). 
Then the uncertainty entropy for { = (' becomes , 
Hyq 0 = Hyogs * — Py (e)logP,(c)-[1-P,(c)]1X 
log (1-P,(¢)] 

and for isi! 


(2h) 


Hyys = - Pg(d)logPy(d)-[1-Pq(d)] logl 1-Py(a)] (25) 


The difference between the entropy of the confir- 
mation - denial code and the gain in information 
at the receiver is then 


(H(cd)- AI) = é. P(X4 5Y4) Hyy 


ae 


26 
Zz. (26) 


P(X,» Yyr) Hyye 


But P(Xyy4) is the probability, when the feed- 
back channel is error-free, that a confirmation 
will be subsequently transmitted; while 

P(X; ,yg:) is the probability of a denial being 


sent. 
Hence } P(Xi» Yi) = P(c) 


33 


2 vo o%y1) = P(d) (27) 


mendes Wi difference between H(ed) and AI is 
the equivocation of the direct channel to the 
transmission of the confirmation - denial code. 


Corollary IThe gain in information at the 
receiver due to the reduction in equivo- 
cation by the confirmation - denial process 
is equal to the net information of the 
confirmation - denial signal when the feed- 
back channel is error-free. 


Another interesting case is that the confir- 
mation - denial process may be assumed error-free 
This is expressed mathematically as: 
Py(e) = P.(d) = 0 
Py(d) = P(e) = 1 

and Ps41 reduces to the simple relation 
Pas wtPy (2) 

Stated as a corollary, this result is: 


(28) 


Corollary II Even with an error-free 
confirmation - denial process, noise in the 
feedback channel results in a net loss in 
the direct channel information rate. The 
loss is given by 


loss = 2 P(Xy5 Y;,) Hays 
ii! 
where , 
Noster Py, (24) log Pras! i) 
yy, 24)) log [1-Py, | (24)1 
is the equivocation of ne feedback channel. 


Approximate Equations for Change in Information 
Rate 


- [1-P. 


While H441 can be evaluated for special 
systems, its evaluation in general is by no means 
simple. Upper bounds for the difference between 
the entropy of confirmation - denial signal and 
the gain in information may be more easily 
determined. This difference is greatest for the 
largest Hy41- Hence, from equation 23, 


[H(cd)- AI] < He P(X, Y,)H,,, (max) 


Or, 


(H(cd)-AI] < Hy4(max) (29) 
since, 

= P(X, Yy1) = 1 

ii' 


A lower upper bound may be determined by 
the following consideration. For the feedback 
to be at all effective the probability of 


obtaining confirmation, when a word from the * 
X; group is sent and a word from the corre- 
sponding Y; group is received, mst be better 
than one - half. 


Py, > 1/2 
or correspondingly 

Py4/< 1/2 when iy i’ 
Therefore the maximum value of H;;, occurs 
for some i=i'. A more refined upper bound for 
the loss in information is thus 
Lss< = P(X;) Hyy 

i 

(30) 


} 


where P (X;) is the probability of transmitting 
a word from the X; group, and P(Y;,) is the 
probability of receiving a word in the Y;, group. 


or Loss < = P(Y;/) Hy oye 
4! 


The relation that gives the least upper bound 
should be used. 


II Discarding Systems 


A discarding system is a special case of 
information feedback systems. It is charac- 
terized by the following: 


1. The denial signal or erasure, @ is 
transmitted as one of {x}. and received as the 
corresponding member (or members) of {y}. 


2. Confirmation is implied with the trans- 
mission of subsequent information. 


3. A denied word is corrected by repeating 
the same word following a denial. The remaining 
information in a denied word is totally dis- 
carded. 


General Relation for the Signal Entropy of the 
Confirmation - Denial Process. 


Consider a code where the received signals 
may be any one of the states y,. If the 
probability of receiving y, is P(y,) then 


2 r(y,) = 1 


Let Z be the denial or alerting signal. If the 
probability of receiving J is P(Z) then the 
probabilities P(y.) must accordingly be re- 
duced in the presence of the confirmation - 
denial process to 


(1-P(Z)] P(yg) 


The required signal entropy is then 


-P(Z)logP(G)- & ([1-P(G)]P(ys)1ogl1-P(G)IP(yg) 


3 


or 
- P(Z)log P(G)) - C1-P(G)J1ogl1-P(g)] 
- (1-P(Z)] Se P(y,)logP(y,) 

8 


The last term in this expression represents 
the required signal entropy for the transmission 
of new information when the preceding message 
is confirmed. The first two terms therefore 
represent the entropy required for the confir- 
mation - denial process. Since P(%) represents 
the probability of confirmation P(c) and 
{1 - P(G)] represents the probability of denial 
P(d), the entropy required for the confirmation- 


‘denial process is 


- P(c) log P(c) = P(d) log P(d). 
Pair Creation and Pair Annihilation 


An undesirable by product of the discarding 
system is the non-conservation of the number of 
message words. A pair of message words is created 
by the erroneous transmission of an erasure into 
a message word. Similarly, a pair of message 
words is annihilated by the erroneous trans- 
mission of a message word into an erasure. 


For transmission of information which has 
intrinsic redundancy, such as English language, 
the non-conservation effect is not serious. In 
the received message, the extraneous pair and 
the erroneous erasure are easily detectable. 
However, for messages without redundancy, this 
effect will cause dislocation of subsequent words, 
which may be considered totally erroneous in 
certain applications. In these cases, the trans—- 
mitted message should be coded, and feedback may 
be used for further improvement of reliability. 


From the probability point of view, the 
amount of information contained in a received word 
word is still uniquely defined irrespective of 
the problem of non-conservation. Its equivo~ 
cation can be calculated from either equation 10 
or equation 1) depending on whether the word is 
confirmed or unconfirmed. Its uncertainty of 
being the erasure is the partial equivocation 
pertaining to the confirmation - denial infor 
mation. For the entire message, the proba- 
bilities of its being a certain probable trans- 
mitted message are different before and after the 
transmission. The amount of transmitted infor 
mation is uniquely defined even though the re= 
ceived message may not have exactly the same 
length of the transmitted message. 


However, for counting the number of errors 
in a received message, some artificial rule is 
necessary. Since a created pair or a annihi- 
lated pair is caused by one undetected error in 


pa Aad message, it will be accounted for as 
sucne 


Iterative Discarding System 


In a RVapa ixctdieosndints system, the feed=- 
back process is used to improve the confirmation- 
denial information as well as the main message. 
Its rules are as follows: If the feedback mes- 
sage indicates that an incorrect direct channel 
message was received, the sender transmits an 
erasure followed by the correct message. If the 
feedback message indicates that an erasure was 
incorrectly received as a message word the sender 
transmits two erasures to erase both the message 
word and the preceding incorrect message. When 
the feedback nessage indicates that a message 
word has been incorrectly received as an erasure, 
the sender transmits not another erasure, but 
simply the preceding message and the intended 
message. 


These rules may be summerized in tabular 
form. In this table 0, 1 represent the mes- 


sage symbols, % represents the erasure symbol, 
and P represents the preceding unerased symbol. 


R=Received — 


| T=Transmitted 


The information that a particular message 
is confirmed or denied is not entirely contained 
in the absence or presence of an erasure signal. 
Each succeeding message is 4 confirmtion of a 
previous choice. To illustrate this point, some 
examples of discarding systems with error-free 
and totally distinctive feedback channels will 
be given below: 


Example I 


As an example consider the message to be 
transmitted 
"GONE WITH THE WIND" 


Let [] denote a space. The messages are: 


iL % 
Transmitted G 0 


Reported GOPGYNOTPPFELIWPWITH... 
A aad od 


The erasing procedure, indicated by the 
arrows, permits the correct message to be deter- 
mined. 

GONE WITH. .c.e 


The important result is that independent of where 


35 


the errors originate, as long as the feedback is 
error free and totally distinctive, all errors 
will eventually be corrected and the final mes= 
sage will be error free. 


One of the difficulties of this and other 
feedback systems is the delay resulting from the 
spatial separation between the sender and the 
receiver. A method of overcoming this difficulty 
without standby operation, is to delay the corre- 
spondence between the corrective messages and the 
original message. 


Example II 
As an example assume that the sender will 
receive the feedback message before he is about 
to send the fourth succeeding letter. For 
convenience, arrange the message as follows: 
GON E 
Cia alee? 
5 ye de dee | 
E {)wi 
ND [] 


and send the message by scanning each row in 
turn, Each colum will then do its own cor- 


recting. 


Transmitted Message Reported Message 


GONE ay 0 e E 
(lw gt @ws® 
Gc {ln g aaa 
eae (1@\r ¢ 
HgT @G H ile ¢g 
EeGd Wat E@wr 
N []- H N (]- H 
- (])- 1 Sent 
Se ap ee 


The circles indicate errors in reception. 
Independently applying the erasure rules, as 
indicated by the arrows, the message is re- 
covered. 


In this example the message was very short. 
If, for instance, the first chaper of Gone With 
the Wind was transmitted by this process the 
number of errors in each colum would be almost 
equal. The time required to transmit the long- 
est column would then be virtually équal to the 
time required to transmit the shortest. 


For instance, in Example I, upon receiving 
the ninth message the receiver is not absolutely 
sure that an erasure was sent. However, if some 
other message was wrongly received as an erasure, 
the transmitter would have sent an O as the tenth 
message instead of an E. Therefore upon receivirg 
the tenth message the receiver becomes more cer- 
tain that the ninth message was correct. Of 
course, the tenth message could have been an 0 
which was wrongly received as an E, but in that 
case, upon reporting E the eleventh message would 
have been J instead of []. In a similar manner, 
later messages confirm or deny the correctness of 
the previous messages. By the time a long mes- 
sage has been transmitted, all previous confir- 
mation-denials are error free. It is therefore 
not necessary to elaborately code the erasure 
message to obtain error free confirmation - 
denial. 


Mathematically, the probability of an er- 
roneous confirmation after receiving n subse- 
quent message words is 


Py (xq) = i Py, () (32) 


where Py, (xf) is the conditional probability 
after receiving ys, that the erasure was sent. 
Similarly, the probability of an erroneous denial 
after receiving n subsequent message words follow 
ing an erasure is: 


n 
Py, (,) =P, 4 (=) Py, (0) TE Py, p32) 


where "yg (x) is the conditional probability after 


receiving the erasure that a message word was 
actually sent and P__ (x_) is the conditional 
probability after receiving first subsequent mes— 
sage word that the erased message word was actu 
ally sent. 


Since Py. (x) is mch smaller than unity, 


¥* 
upper and lower bounds u and 1 may be defined 
such that 


u>- log, Py (xg) poe (23) 


for all sf#J. Equations 31 and 32 may be written 
as; 


-nu -nl 
Cay Py (x,) <e (34) 
-(n-1)u --)1 
Pyg (x)P,, Je LPyq(XePyg(x)P, (xe 
*the letter "1" is underlined to Wotinghish it 


from the number "1". 


The above shows that with a totally 
distinctive noiseless feedback channel, the 
probabilities of erroneous confirmation and er- 
roneous confirmation and erroneous denial are 
reduced to zero at approximately an exponential 
rate. It is not necessary, however, for the 
feedback channel to be totally distinctive on the 
message words themselves. The above deduction 
is equally valid as long as the member of{ X 
containing % consists of Y only. In case y) and 
X, happen to belong to the same group, equation 
32 may appear to be wrong at first glance. How- 
ever, in such a case, as a message word of the 
same group is finally received, the erroneous 
first denial is considered as having been cor- 
rected for, and the final confirmation denial 
signal is not in error in itself. 


A more complicated situation exists if the 
member of { Xx} containing the erasure contains 
message words as well and the feedback channel 
is not error free. In order to utilize equation 
8 to determine the cmditional probabilities of 
a finally accepted message word, an approximate ~ 
expression will be derived for Py(c) which is 
the error probability that a word is finally 
confirmed while an erasure was originally trans- 
mitted. Neglecting intersymbol influence, one 
has approximately 


i] 
Pa(e) = a Pry (vagu) Pry (2p) Pe (©) 
+ a Pxg (vi jx) [1-Py, (2g)]Pa (e) 


+ os Pag (vi gk) C1-Py, (Zp) ]Pxg(vg)Pale) (35) 


In equation 35, y' means summing over all values 
of 37. On the right hand side of equation 35, 
the first term represents the probability that 
the error signal was received as a message word 
without subsequent knowledge of the sender. The 
second term represents the probability that 
knowing the mistake, the sender transmits an 
erasure but has somehow confirmed the mistake 
instead. The third term represents the proba- 
bility that having erased the mistake, the send- 
er has- transmitted a second erasure which some- 
how fails to erase the word which he intended to 
erase to begin with. Equation 35 can be written 


as: ' 


& Pag F,) Pr, Gp) Pele) 


1- LISP ay Gg & Pxg (Fs 54) ae 
36 


The above results can be summarized in the 
form of a theorem: 


P4(e) = 


Theorem II In a iterative discarding 
system, the error probabilities of the 
confirmation - denial signal are suc- 
cessively reduced by the iterative proc- 
ess. If the feedback channel is error 
free, and if the feedback group con- 
taining the erasure signal contains it as 
the one and only signal, the error proba 
bilities decay to zero at approximately an 
exponential rate with each subsequent con- 
firmed signal. Otherwise they converge 
rapidly to constants. The probability of 
failure to finally erase an erroneous word 
is approximately 


\) 
ie Pag an) PY, pf X  px(e) 
Pyle) = a 


IL +Px (¥g)] a Peg (Va gk) Py, (29) 


Equivocation of a Confirmed Word 


In a iterative discarding system, a finally 
accepted word has the following background: 


1. It has been confirmed. 


2. In its previous history, it could have 
been wrongly transmitted and erased. However, 
such occurance does not change its equivocation 
in any way. The rejected word is totally dis- 
carded and the receiver is not capable of real- 
izing whatever information left in it. 


In the derivation of Theorem 1, equation 11 
is perfectly general and will express the equivo- 
cation of a finally accepted message word if Pi4' 
is interpreted as the iterated conditional proba= 
bility that the feedback word Y4! will eventually 
be confirmed, when x; j€ X; was originally trans- 
mitted. 


Theorem III The equivocation of an accepted 
message word is 


(x) = logP. 


Mey sng 741 4g OP eo Poygt stip 


{08 Piz: + log Pystgtys (xx; 5) j 


where Ps,; is the iterated conditional 
probability that the feedback word Yj: will 
eventually be confirmed, when x, 4 €{ Xy}was 
originally transmitted. Py,,, jt (c) and 


oa wai AS are respectively 
Pyit4r,1 > z i Pais Pyingiks 43) 


37 


P4itPy st ytyt (xij) 


P rary Hg) = 
Cyr yr WJ (e) 


Jit gtk! 


Corollary I If the feedback channel is 
error free, and if the feedback group con- 
taining the erasure signal contains it as 
the one and only signal, the equivocation 
of an accepted message word is 


Hoygt gnyes (x) = log ae Pyar gnit (xy14) — 


= Mitgne (x515) log ri (x51 ) 


J 
S Poet gte (x, 30) 


provided that the message word is followed 
by a sufficient number of accepted message 
words. 


In other words, the equivocation left in 
such a system is equal to the uncertainty left 
among the signals within the same group. The 
Corollary follows directly from the Theorem 
since in such a system, the iterated Py4: is 
equal to § at’ 


Corollary II When the error free feedback 
channel has a capacity at least as great 
as the direct channel, information may be 
sent error free at a rate equal to or less 
than the information rate of the direct 
channel. The equality holds if there is 
no left over information in the rejected 
message wordse 


Corollary II follows from Corollary I 
of Theorem I and Corollary I of Theorem III. 


III Examples of Iterative Discarding Systems 


1. Error Free Repetitive Feedback 


Let us assume a channel with N symmetrical 
message positions per digit. The probability 
that a digit is received incorrectly is p. One 
of the N positions will be used as an "erasure". 


From the preceding section the final 
message will be error free, but the percentage 
of informative digits that get through is 


m =(1 - 2p) (37) 
Equation 37 is due to the fact that no matter 


how an error occurs, two digits are nullified 
with each error occurance. The information rate 


per digit is 
R} = m log (N-1) = (1-2p) log (N-1) (38) 


The information rate of the direct channel alone 
is 


R! = log N- E, (39) 
where E, is the equivocation per digit and de- 
pends upon the error distribution. If equal 
distribution among the N-1 states is assumed 

Ey = - p log BR — - (1-p)log(1-p) (40) 


The net loss in information rate due to feedback 
is 


p 
QR! =R} - Ry = log N+p log——+(1-p)1log(1-p) 
Nel 


- (1-2p)1log(N-1) (41) 
Equation 35 can be written as 


AR! = p log p + (1-p)log(1-p) 


+p log N + (1=p)log (42) 
N-1 
1 
Let Py be defined as » then 
N 
le 
AR! = p log p/p, + (1-p)log (43) 
1-p; 


To evaluate AR', let us determine its stationary 
point 


age Se aoe + DW ames oor ( 44) 
Py Py ep, sp (AP ,) 


Hence AR! has one and only one stationary point 
at p,) = pe The stationary point is 4 minimum 


ad R! 
since in is negative for Py < p but positive 
Al 


for p, >p. Therefore R'>0O. The equality 
sign holds only if P) = por Np el. 


If Np = 1 two conditions are met: 


(a) The rate provided for confirming infor- 
mation, 


log N = (1-p)1log(N-1) 
is equal to the information that must be con- 
firmed 


-p log p - (1-p)log(1-p). 


(bo) There is no residue information in a reject 
word. Once a word is rejected, its probability 


of being correct is the same as that of the 
erasure being wrong. It is p= vo 


2. Error Free Parity Check Feedback 


Both the direct transmission channel and 
feedback channel are composed of binary digits. 
Tne feedback channel is error free but of very 
low capacity. For every m digits sent through 
the direct channel, 1 digit is reported to check 
its parity. 


p = probability of error per digit, 
direct channel 


p' = probability of odd errors, which will 
be detected and erased. 


p" = probability of even errors which will 
be passed unnoticed. 


One of the 2” messages per group will be 
used as an erasure. 


P, ™ probability of acceptance = 1-2p! 


p. = probability of error in accepted 
message groups, p" 
2p! (45) 


Equation 45 is only approximately true. 
It is assumed that the error probability p" is 
smal] such that the probabilities of two errors 
making it correct or making only one error in 
the end, etc. are neglected. For more accurate 
calculations, equation 36 should be used. 


Ce = channel capacity per digit with feed- 


back cee 1 CS 
= (1-2p') — log (2" -1) (46) 
ign 
p! a 2. pi(1-p)@t mi 
odd 
~{{1-p) + pI™ = C(a-p) -pI™} 2 
1 
pt = [1 - (1-2p)"J (47) 
i<m 
4 oy mJ 
= 1 
Dae Mat pati) it (m-i)! 
even 


= (1p) + pJ™ + [(1l-p) = pi" } /2 


- (1-p)” 


1 
pt = (1 - 2-p)m + (2p) J 


From equations 45, 46, 47, and 48 p, and Cp 
can be calculated. 


3. Noisy, Parity Check Feedback 


In this case, probabilities of odd and even 
errors will include the feedback digit. For 
simplicity, let us assume that the feedback digit 
has the same error probability p. Equations 
(45) and (46) will remain unchanged, and 
equations (47) and (48) become 


pt = Ea = (1-2p)""4y (49) 


1 m+1 
ph = 1 = 2(-p)" + (i-2p) 1 (50) 


lL. -Error Free Hamming Check Digit Feedback 


For every m digits sent through the direct 
chamnel, c digits will be fed back. Out of 
these, 1 digit is for parity check on the m 
digits only and the remaining c-l digits are for 
single error correcting double error detecting, 
or triple error detecting. Since there is no 
need to check on the c-] digits themselves, 

Milignin ey yee gla (51) 
The inequality 51 determines the number of re- 
guired feedback digits. 


Using this code up to 3 errors will be 
detected and erased. If there are four errors, 
the only cases which are not detected are the 
ones which have the same check digits, i.e., 
if m=7, the sequences 0000000 and 0111100 have 
same check digits,0000. Since with three 
arbitrarily placed error digits, the position of 
the fourth digit is fixed by the check digits, 
(for equality of check digits), a condition 
which always exists if 


ne 2e-1 . 1 


but sometimes does not exist if 


1 


pit 


the probability of four errors is 


m=2 

P, < ots ) p* (1p) (ea) 
8 

Five errors will always be detected by the parity 

check. The probability of six errors is mich 

smaller than p,. Hence the probability of an 

unnoticed erro is 


pi mC) (ee) Poe ae 


(48) 


39 


The probability of being correct is (1-p)™. 
Therefore the probability of a detected error is 


ple 1< (lep)a< _n(m-1)(me2) hy) 


(54) 
Pe and Cr can be calculated from equations 45, 
and 46, using the values of p! and p" from 
equations 53, and 5h. 


If direct repetition were used, the error 
probability would be of the order p<, instead of 
p+ as is obtained with feedback. 

5. Noisy, Hamming Check Digit Feedback 


In this case, the c feedback digits are 
checked together with the m direct information 


digits. Hence 

Psatestiten bade We ar ni aye 5) 
or 

c=2 col 


(55) 


If the same error probability p is assumed 
for feedback digits as for the information digits, 
p" and p' become respectively: 


ne (m+c)! in mc=), 

p Er mns i: pt (1-p) (56) 

: m+c (m+)! I; mo-k 

p! #1 -(1-p) i area) 
(57) 


Equations 455 46, 56, and 57 give values of p, 
and Cre 


One case of special interest is that of 


m4. Inequality 55 gives c=. The possible 
combinations are as follows: 


Feedback Digits 
h 2 1 Parity 


Direct Message Digits 
Tarn Oe Shen ae 


HPHEHHMHHYHOOOCOOOOO 
FPHEHOOOOKFPHFHrHOCOOO 
FPRHOORPRPOORFPKHPOOFRHOCO 
KFOHPOPOPOHOHOPFOHO 
FPrPOODOOOFPRPOOHRPHFHOO 
POP OCOPOHOHOFPHOYrS 
HOOHPHOOKFOHHPOOPRPHO 
FPOOCOrFOrFPRPORFOCOFOFFHOSO 


From the above table it is seen that 


1. There is a one to one correspondence 
between the direct message and the feedback 
message. In terms of the general description 
of information feedback, each feedback group 
contains only one message, 


2. The distance d(x,y) between two djrect 
messages x and y and the distance d (x', y ) 
between fhe two corresponding feedback messages 
x' and y are related as follows: 


d(x,y) d(x" ,y') 


al 3 
2 2 
3 iL 
4 4 


The general philosophy of noisy feedback 
appears to be: 


"Select the corresponding feedback messages 
such that if two messages are close together in 
Signal space, their corresponding feedback mes- 
sages are far away in the signal space, and vice 
versa", 


Conclusion 


In the above, the process of information 
feedback is formulated in a relatively general 
form, and basic relations governing information 
rate and reliability are derived. 


Typical examples are given which show that 
the feedback process is by far more effective 
in improving reliability compared to direct 
coding. When a feedback channel of the same or 
less error probability is available, equivocatim 
can be effectively reduced without laborous 
coding. The loss in information rate is small. 
In the limiting case of error free feedback with 
sufficient capacity for simple repetition, error 
free transmission can be obtained at little or 
no cost of information rate. When the feedback 
channel is not reliable, it can be coded such 
that substantially less but comparably reliable 
information will be reported to the sender and 
much of the advantages of feedback can still be 
realized. 


However more questions are raised than 
answered. Just to name a few: Could predictive 


Ne) 


feedback be used to reduce redundancy and to 
increase communication rate? What are the cri- 
teria of optimum coding of the feedback channel 
in case it is noisy? At what signal to noise 
ratio could direct transmission channel be un- 
coded without signigicant loss of informtion 
rate? Is there any advantage in a compound 
feedback process in which both decision feed- 
back and information feedback are used? It is 
hoped that this paper will focus attention on 
the problems related to information feedback so 
that they will be answered by future 
investigations. 


Acknowledgment 


This work has been sponsored by the 
Air Force Cambridge Research Center, Air 
Research and Development. Command, Cambridge, 
Mass., under contract number AF19(604)1049. 


The writer is indebted to his colleagues 
Mr. Frank J. Bloom and Mr. Bemard Harris. The 
study of feedback systems was undertaken at the 
suggestion of Mr. Bloom who also directs this 
research project. Mr. Harris has done an 
excellent job in editing portions of the quoted 
reports“from which substantial sections of this 
paper are taken directly. 


References 


1. The phrase "in part" is in the Shannon sense 
of choice or uncertainty, that the feedbeck mes— 
sage reduces but does not eliminate the un— 
certainty of the message which has been received. 
See C.E.Shannon end W.Weaver"The Mathematicel 
Theory of Communication" pp.18-22, University 

of Illinois Press, 1949. 


2. "Supplementary Notes on Evaluation Theory 
For Communication Systems", Seventh Quarterly 
Report, September 15, 1955 to December 15,1955; 
end First Scientific Report, January 15, 1956 
to March 15, 1956. Submitted to Air Force 
Cambridge Research Center by Research Division, 
College of Engineering, New York University. 


3. Re We Hamming, "Error Detecting and Error 
Correcting Codes," Bell System Technical 
Journal, 29, pp-147-160 (1950) 


4. For exemple of iteration in a direct coding 
scheme, see Peter Elias "Error—Free Coding" 
Trans. IRE, Information Theory 4 (1954) 

pp. 29-37. 


A LINEAR CODING FOR TRANSMITTING A SET 
OF CORRELATED SIGNALS 


H. P. Kramer and M. V. Mathews 
Bell Telephone Laborateries, Incorporated 
Murray Hili, N. J. 


ABSTRACT A coding scheme is described for the 
transmission of n continuous correlated signals 
over m channels, m being equal to or less than n. 
Each of the m signals is a linear combination of 
the n original signals. The coefficients of this 
linear transformation, which constitute an mx n 
matrix, are constants of the coding scheme. For 
the purpose of decoding, the m signals are once 
more combined linearly into n output signals which 
approximate the input signals. The coefficients 
of the coding matrix which minimize the sum of the 
mean square differences between the original sig- 
nals and the reconstructed ones are shown to be 
the components of the eigenvectors of the matrix 
of the correlation coefficients of the original 
signals. The decoding matrix is the transpose of 
the coding matrix. 


As an example, the coding scheme is applied 
to a channel vocoder in which speech is trans- 
mitted by means of a set of signals proportional 
to the speech energy in the various frequency 
bands. These signals are strongly correlated, and 
the coding results in a substantial reduction in 
the number of signals necessary to transmit highly 
articulate speech. 


The coding theory can be extended to include 
the minimization of the expectation of any posi- 
tive definite quadratic function of the differ- 
ences between the original and reconstructed sig- 
nals. In addition, if the signals are Gaussian, 
the sum of the channel capacities necessary to 
transmit the transformed signals is shown to be 
equal to or less than that necessary to transmit 
the original signals. 


INTRODUCTION 


The coding scheme described in this paper 
can be used for the transmission of any set of 
correlated signals. As an example throughout the 
paper, the transmission of the energy signals in 
the Channel Vocoder will be considered. However, 
the scheme is in no way limited to vocoders. 


The vocoder is a device developed by H. W. 
Dudleyl* for reducing the bandwidth necessary to 


ee ee we mm ee ee eee em me ee ee eR we eee Oe ee ee ee ew Rw ewe ee 


* Superscript numbers refer to references at the 
end of the paper. 


transmit speech signals. In the vocoder the 
speech is passed through contiguous band pass 
filters and the outputs of these filters are 
rectified and passed through low pass filters. 
The resulting transmission signals are measures 
of the energy of the original speech in the 
various frequency bands. These energy signals 
can then be transmitted over separate channels 
and the speech reconstructed at the receiving 
terminal by a modulation process. 


Figure 1 is a sample of a typical set of 
vocoder energy signals for a 10 channel vocoder. 
As can be seen from the figure, the various 
energy signals are highly correlated and conse- 
quently some coding should exist which removes 
this correlation and results in an even further 
reduction of channel capacity or alternatively 
more efficient exploitation of the vocoder scheme. 
One such coding was suggested by the following 
line of reasoning: Since the signals are so 
well correlated a desirable procedure might be to 
send the first signal and the difference between 
the first signal and the second signal, the first 
signal and the third signal, etc., these differ- 
ences being small. Perhaps if the correlation 
between two channels were sufficiently high, the 
difference signals for these two channels would 
be small enough so it could be neglected and thus 
the total number of channels necessary to send 
would be reduced by one. 


The procedure of sending difference signals 
suggests transmitting linear combinations of the 
signals: 


e ti ale eae (1) 


n 
xX, = 2 A j 


jal ocd 


In equation 1 the ej signals are the original 
time varying vocoder signals, the x; signals are 
the time varying transmission signals and the Aig 
coefficients are constants of the coding. At 
the receiving terminal the e, signals are con- 
structed by a second linear transformation 


m 


et. Sasa Buk ee oe ee (2) 
ab j=l al alana 


th 


in which e'; is the reconstructed e; signal. A 
diagram of an entire transmission system is shown 
in Fig. 2. If the number m of xj signals equals 
the number n of ey signals, then the e; signals 
can be reconstructed exactly at the receiving 
terminal by making the matrix of Big coefficients 
inverse to the matrix of Ay coefficients. If 
some signals have been neglécted so that m< n 
then the e, signals can be reconstructed only 
approximately at the receiving terminal and the 
question arises what choice of Ay 3j> B;, coeffi- 
cients will minimize some measure of the error so 
introduced. The error measure considered here 

is the sum of the mean square errors given by the 
relation. 


E (3) 


As might be expected this selection of error 
criterion yields a simple solution to the mini- 
mization problem, but in addition it has been 
justified experimentally to some extent in the 
case of the vocoder. 


The exact formula for the choice of coef- 
ficients will be discussed in the next section. 
However, the heuristic basis for this choice can 
be pointed out here. At each instant in time 
the values of the n channel signals can be con- 
sidered as the coordinates of a point in n dimen- 
sional space. For example, a two channel vocoder 
can be represented by the two dimensional space 
of Figure 3. As time progresses the point will 
trace out a pattern as shown in Figure 3. Be- 
‘cause the signals are well correlated this pat- 
tern will tend to concentrate in some particular 
area. The linear transformation specified by 
equation 1 amounts to a rotation of coordinate 
axes on Figure 3 to a new set of axes Xp» Xp 
If the rotation is chosen so the x, axis 
corresponds to the major dimension of the pat- 
tern then the x, signal will contain most of the 
information about e) and ey. Consequently, 
neglecting the x5 signal will result in a mini- 
mum error. This procedure will be stated ex- 
actly for the n dimensional case in the next 
section. 


OPTIMUM 
TRANSFORMATION COEFFICIENTS 


The mean square error given by equation 3 
in the previous section is a function of the A;,, 
B;, coefficients and of the correlation coeffi- 
cients Rij of the channel signals where 


R = Soles, . (4) 


In the Appendix it is shown that to minimize the 
mean square error when transmitting m (<n) sig- 
nals Aig should be the jth component of the ith 


normalized eigenvector of the matrix (Ry). These 
conditions may be specified by the equations 


= (5) 
L, Ay; ae Rox Aa 
and 
= 5 es oe aac 
anus to idk ah 


where Ly is the eigenvalue associated with the 
eigenvector and the eigenvalues are arranged in 
order of decreasing magnitude. 


Le hte en Sees (7) 


The B coefficients are related to the Ai coef- 
ficiettts by, 


(8) 


The error is the sum of the eigenvalues for 
the n - m channels which have been neglected, 


2s id (9) 


The optimization procedure can easily be ex- 
tended to include the minimization of the average 
of any positive, definite quadratic form of the 
individual channel errors. 


CHANNEL CAPACITY REDUCTION 


In addition to the error minimization pro- 
perties described in the previous section, the 
eigenvector transformation can be shown to yield 
a saving in channel capacity if the e; Signals 
are Gaussian. The computation of channel capacity 
is complicated by the fact that e, and x; are con- 
tinuous signals, and their source rates are infi- 
nite unless a fidelity criterion is used. How-- 
ever, if the mean square fidelity criterion M (see 
Eq. 3) is applied to the total transmission system, 
the minimum sum of the e; source rates which satis- 
fies the M criterion can be shown to be equal to 
or greater than the minimum sum of the x, source 
rates which satisfies the M criterion. is re- 
sult may be interpreted as meaning that if the SH 
signals are transmitted over n independent chan- 
nels, the sum of the channel capacities necessary 
to meet the M criterion is equal to or greater 


than the corresponding sum for the x; signals. 


The existence of a channel capacity saving 
can be established even when no channels are neg- 
lected (m =n). The saving for m = n is most 
simply shown when the x; signals have flat band- 
limited spectra, in which case the source rate of 
an xy signal may be written@ 


x 3 
4 2 
R, =Wilog— x, >n (10) 
2} rer wa 
By 


where W is the bandwidth and ny@ is the allowable 
mean-square transmission error. The sum of the 
X4 source rates is 


4 


a (ut) 


n 
Rx = § 2) Wadog 
4{=1 

Oe 


By summing the effects on ey of the n, er- 
rors the resulting M fidelity criterion can be 
shown to be 


fe 


n 
M= = ne (12) 


If the ny@ errors are distributed among the vari- 
ous channels so as to minimize R, while satisfy- 

ing Eq. (12) the resulting distribution has equal 
errors in each channel. The minimum value of Ry, 

so obtained is 


n : 
Re eis og 3 Se: G13) 


z Gees 


Similarly, the minimm sum of the e; source rates 
which satisfy Eq. (12) is 


nee 
i (14) 


n 
Rove W108 er 


ie ae 


Because the x4 signals are uncorrelated and (Aij) 
is an orthonormal matrix, e;2 can be written in 
terms of x4“ as 


Prange (15) 
If Eq. (15) is substituted into Eq. (14), the 


difference in channel capacities, R, - Re, may be 
written 


a 
n x 
R,SRg= 2! Weloe = (16) 
— 5 Ae oe 
jel ji 


That difference is less than or equal to zero for 
any orthonormal (Ajj) matrix is a direct result 
of the convexity of the logarithm function. 


A similar proof may be carried out for the 
more general case where the ej signals are Gaus- 
sian but with any arbitrary spectra. 


Under some circumstances the allowable er- 
ror for a si , Dy = M/n, may be greater than 
the -signal x,;*. The required channel has es- 
sentially zero capacity and may be neglected, 
thus reducing the number of channels to be trans- 
mitted. 


EXPERIMENTAL TEST OF CODING SCHEME 


The coding procedure which has been des- 
cribed was evaluated by applying it to a 16- 
channel vocoder. The correlation coefficients 
Ry j were obtained by time averaging ejej over a 
four minute sample of speech to the vocoder from 
eight different male speakers. The eigenvectors 
associated with the R, matrix and the associated 
eigenvalues were coumaved on an IBM 650 computer 
using an iterative routinet. The eigenvalues so 
obtained were as follows: 9.57, 1.36, 0.72, 0.55, 
0.355, Osak, 0:31,°0.25; 0520, 0.10, 0.07,70705, 
0.01, 0, 0, 0. The theory shows that the error 
committed by omitting a channel is equal to the 
eigenvalue associated with that channel. As a 
consequence, in order to obtain a substantial re- 
duction in the number of channels, the eigenvalues 
should have a large spread. This condition exists 
for the vocoder data. 


Systems with one, three, six and ten trans- 
mission channels were realized and their perform- 
ance is roughly as expected. The one-channel 
system was articulate only on very familiar text 
such as "Mary had a little lamb ...". ‘The three- 
channel system was estimated to have a sentence 


articulation rate better than 50% for unfamiliar 
text. The six-channel system was almost com- 
pletely understandable though its quality was 
substantially less than that of the 16-channel 
vocoder. The ten-channel system was of good 
quality. 


The quality of the 10-channel system was 
judged to be better than that of an existing 10- 
channel vocoder with which direct comparisons 
could be made. This result leads to the conclu- 
sion that to achieve better quality with a given 
number of channels, a system should be used con- 
sisting of a many channel vocoder plus a matrix 
transformation to attain the required number of 
transmission channels. 


Equations (14) and (15) were applied to 
evaluate the upper limits of the source rates of 
the ey and x; signals by assuming both sets of 
signals are Gaussian with flat spectra, band 
limited at 15 cps., and that the allowable mean 
square deviation > e,-/M is 100:1. These figures 
have been established as being approximately cor- 
rect for existing vocoders. A capacity of 1500 
bits/sec. will transmit the e; signals directly 
and a capacity of 890 bits/sec. will transmit the 
x; Signals. Thus a saving of a little more than 
1/3 of the original channel capacity is achieved 
by the matrix. 


Certain discrepancies between the observed 
performance of the transmission system and the 
expected performance based on the eigenvalues 
can be attributed to the difference between the 
mean square fidelity criterion and that used by 
the ear. For example, preliminary articulation 
measurements indicate a relatively lower articula- 
tion score for certain short, but important 
sounds, such as stops (p, b, t, etc.). Because 
these sounds are of such short duration, their 
spectra would not be well represented in the time 
average used to evaluate eje;,thus large errors 
could be expected in their transmission. 


CONCLUSIONS 


A linear coding scheme has been developed 
for the transmission of n correlated signals over 
m channels. The coding is an optimum way of re- 
ducing the number of transmission channels where 
fidelity is measured by the sum of the mean 
square errors. In addition the coding results in 
a reduction of the sum of the channel capacities 
necessary to send the signals over independent 
channels. 


The increased efficiency of the coding pro- 
cedure may be utilized in two ways in a vocoder. 
Either speech of substantially the same quality 
can be transmitted over fewer channels or higher 
quaiity speech can be sent over the same number 
of channels. 


The coder involves no memory and thus can 
be instrumented very simply with a resistance 
network plus sign inverting amplifiers. This 
simplicity is a practical advantage over a 


more efficient coding scheme which requires more 
complicated instrumentation. 


The minimization is of theoretical interest 
since the mean square error is not a linear 
function. Thus the minimization has been carried 
one step further than in most mean square proce- 
dures, which essentially minimize only linear 
functions. In signal theory terminology this 
result is equivalent to saying that an optimum 
way of both decomposing and recomposing the ori- 
ginel signals has been developed instead of the 
usual optimization which includes only an optimum 
recomposition after an arbitrary decomposition. 


Appendix 


The problem at hand is to minimize the 
average error, 


by appropriately choosing an m x n matrix A and 
ann xm matrix B. We begin by determining 
matrix coefficients A and B that will result 
in m a stationary valté for M® 


es: ( 5S )( ) 
= 2 (e. = 55 Yee ops nly Als >) Ae 
OB, rr ere Toros SP rena Fe? 
(18) 
272) (AR) 2 (ARAB). 
eo = ( 3 ¢ ) 
= (DD eu —m eas Pletent Vs ale B,e 
Oh, em eee kr©s 
(19) 
T_T 
= 2 (Se - 2 (RA B 1p ae 


Here, R is the matrix of correlation coefficients, 
Rij e ctone which is symmetric and positive and 
may be considered positive definite without loss 
of generality. The necessary conditions for 
stationarity stated in matrix notation, are from 
Eqs. (18) and (19). 


AR 


i} 
Es 
> 
wo 


(20) 


B 
I 

oe) 
> 
w 
w 


(241) 


With the substitution BA = C, Eqs. (20) and 
(21) can be put in the form. 


CR = (22) 


RC = (23) 


By transposing both members of Eq. (22) we find 
that ' 


(cr)” = (crc’) = crc" = cr 
and thus 

RC! = cR (2) 
Multiplying Eq. (23) through by RiS and taking 
transposes, we find that 

Guicoe c- (25) 
Combining (25) with (24) yields 

RC = CR (26) 
and Eq. (25) can be written 

C=C (27) 


Recalling that the trace of a matrix is the sum of 
its diagonal elements, we can write Eq. (17) in 
the form 


M=tr[ (I -c)(r - c')R] 


Since R is symmetric, there exists a non- 
singular matrix U with U-1 = UT such that 


terior (28) 


is diagonal. Eq. (26) implies in addition that 


(29) 


is also diagonal. Let the rank of C be k <m. 
Then the rank of P is also k <m and 


ucu =P 


pe 


= vicwtcu = uic*u = ulcu = P (30) 
implies that the diagonal elements of P are either 
zero or one and therefore there are k< m one's on 


the diagonal of P. 


45 


Now 
M=Tr[ (I - C)R] = tr[ u(x - c)uutRU] = S'Li (41) 


where the summation extends over n - k of the 
eigenvalues of R. M is therefore minimized by 
letting k = m and choosing P so that its non- 
vanishing terms correspond to the m largest 
eigenvalues of R. 


The matrix C that achieves this minimum is 
given by f 


-l. 


C = UKU (32) 

A check will easily show that if PUN de- 
notes the kth colum matrix of U, 

R{ul, =L [ul], (33) 


so that [ Uy is an eigenvector of R corresponding 
to the eigenvalue of Lx. The property U-1 = uT 
implies that [U], is normalized. And thus, 
finally, the choices 


(34) 
(35) 


Kee 
Be 


while not yielding unique A and B matrices, 
achieve the desired minimum value of M. 


The above proof was first given by the 
author under restrictive condition on A and B. 
D. Slepian was able to remove these and the 
present proof owes its generality to him. 


References 

1. H. Dudley, Remaking Speech, J.A.S.A., 
May 1939. 

2. C. E. Shannon, A Mathematical Theory of 
Communication, Bell System Technical Journal, 
Vol. 27, July, October, 1948. 

3. G. Polya and G. Azego, Aufgaben and Lehrsatze 
aus der Analysis, Dover, 1945 N.Y.C. 

4, R. T. Gregory, Computing Eigenvectors and 


Eigenvalues of a Symmetric Matrix on the 
ILLIAC, Mathematical Tables and Other 
Aids to Computation, Vol. 7, 1953, 

pages 215-220. 


cig ye 


TIME. c= 


Fig. 1 - Energy signals in ten channel vocoder. 


Fig. 2 - Complete transmission system. X2 


ey 


Fig. 3 - Representation of channel vocoder. 


46 


ON an APPLICATION of SEMI GROUPS METHODS 
TO SOME PROBLEMS in CODING 


By M.P. Schtitzenberger 
(C.N.R.S, Paris) 


QO. Introduction. 


The current paper deals with a chapter in 
what could be called communication theory in exten- 
sive form : it starts with extremely restricted 
structures and it stops where begins the canonical 
problem of optimalisation. It even ends sooner for no 
full use of the definitions is made and the main 
ergodic theorem is stated without proof. 


Actually the nature itself of the question 
under study has commanded these restrictions together 
with the architecture of the paper : we give a 
abstract model of some sort of language and we try to 
show how semi group concepts apply fruitfully to it 
with the hope that some of them may be at least of 
stimulating interest to specialists working on 
natural languages. 


As frequent in the field of cybernetic, the 
mathematics involved even if quite simple are far 
away from classical analysis and, indeed, many of the 
necessary tools had to be sharpened especially for 
the purpose. 


Thus the paper is twofold : in a first part 
the model and its main properties are discussed at a 
concrete level on the simplest cases : the coding and 
decoding with length bounded codes. In a second part 
a selection of theorems are proved whenever the 
necessary semi group theoretic preliminaries are not 
exacting. The link along this tail of appendices is 
the theory developped verbally in the first part. 
Finally a special chapter provides a bridge toward 
probalistic applications. 


It is proper at this place to acknowledge 
the contributions of three authors who influenced 
deeply the building of the theory : 


Sardinas and Patrarson leno discussed first on a 
logical basis the ea coding process. 


B. Mandelbrot 2/who recognised and studied 
extensively the role of "word units" in communication 
theory and related the problem to Feller's recurrent 
events. 


P. Dubrevi (2) ana his school whose pionnering 


work on discrete semi groups has provided many basic 
concepts and arguments as it will be seen below. 


Part I 
1. Preliminary definition of a discrete semi group 


Janguage : 
We shall be concerned with the two basic 
sets of communication theory : 


The set of all messages which may possibly be 
sent. 


The set of all signals available for transmission 
along the line. 


The main feature of the theory is the 
postulationnal requirement that the signals as well 
as the messages pertain both to some common class of 
structures so that coding and decoding not only be 
inverse operations but far more generally, be special 
instances of a quite broad new process, that of 
translation. 


This identity of structure itself between two 
sets is a result from the basic restriction that they 
develop homogeneously in time - or more accurately 
that both admit a common partial order and 
composition operation. 


That such requirements are rather stringent 
is clearly seen by the exemple of photography (two 
exposures give rarely a result which is,in any sense, 
equivalent to a third one) or even by harmonic 
modulation where Fourier transform exchanges so well 
time and frequency that finite signals cannot be 
fully adequate. 


On the other hand, languages either spoken, 
written or gesticulated are somewhat akin with our 
consideration, and we shall use the name of "discrete 


semi group languages" (d.s.g.1.) for naming the 


elemental concepts of our study. 


The definitions below are quite general and 
as said before, no full use of them will be made here 
- very little gain in simplicity would be achieved by 
using more restrictive ones. 


DEFINITIONS : 


I. A discrete semi group language will be a 
set /\ of object called "messages" satisfying the 
following conditions : 


I.1. If A; and \; pertain to A so does their 
"product", :.), made up of ").” followed by “1," 
("a: will be said a left divisor and 4, a right 
divisor of aA, Ne 


I.2. If \. , \, and A, pertain to ‘A. ana 
if Myo MN) amd 4,2 4,4, then \,\., is iden- 
tical with” 4) \,. 


I.3. The "vacuous message" ¢ pertains to A 
and satisfies g\.-\.¢: 3; forall AL, cA 


I.4. There is a sub set Wore fron /\ called 
"dictionary" or "basis" whose elements are called 
"words". /\_is such as : 


1.4.1. y¥ ‘does not pertain to /\. 
1.4.2 forall 3}. ¢ A\-¥ 
either nee y\.7 


7 


either these exist a unique finite set 


of words PS tok ae cea ay ie 
with 
PSR SY ERP ca eee ‘ee 
II. Given two d.s.g.1. A and M a 


correspondence 6 between the elements of two subsets 
N’c A ana H’c M will be said a translation 


if it satisfies ;: 


II.1. The correspondence is one to one where 
ever it is defined. 


II.2. If Md; é IN Od. Evie GA; Sine 
then NGA, Pi and Od Ay Ee an 
II.3. The translation will be said : 


Total frm A toM,ir A=A. 
Subtotal from A to UH, if for 


all >: « A there is at least a d; eh 
/ 
such as 7,4) € Ae 


III. A neat coding of /L into M will 
be a translation total from /\ to M 
subtotal from M to /\. 


and 


In algebraic form we could reduce our 
axiomatic to : 


La: dk is the free discrete semi group 
generated by ees 

II' : A translation is an isomorphism between 
the sub semi groups /\'’ ¢ Jqt and 
A acest 

III': A translation is a neat Coding if A- A 


/ 
and Misa subsemigroup of M neat on 


the right. (Note that "subsemigroup" 


entails I.1, I.2 and 1.3 ; "free" corre:- 


sponds to unique in 1.4.2 , "discrete" to 


finite at the same place). 


2. Practical significance of the axiomatic : 


Let us take a simple example in coding ; 

eo ada ie ter yee 
(1. is the usual binary alphabet; /\ is the set 
of all strings of a finite number of the "elementary 
mes s" A; (<:1,%,4,4) and I1 is built in the 

same way with the "letters" + and 


uN > 


When coding, we want to establish a corres- 
pondence between /\ and some subset /1’ of 
(4 satisfying two conditions : 


1) to every we UN corresponds at least one HE M 
("total" character of the coding 


48 


2) 


i) 
~~ 


4) 


to any distinct }, é A mst correspond 


distinct }P, mM’ € 4’ in order that the 


deciphering be free from ambiguity. 


A priori any one to one correspondance between at 
and a subset M’ from M would do - but usually 
this could imply that we cannot proceed to the 
sending of the message before we know it in its 
totality. So a further practical condition - 

which is not too easy to formulate rigourously - 
could be : 

For a reasonably large number of messages A the 
coding is such that for any right multiple 4’ of A 
(ise. any \=AA" +) the signals J» and ee 
have a reasonably long common left divisor /,, 
(i.e. are of the form: M=M, M2 ond v's BME) 
The simplest way of fulfilling these desiderata 

is to assign to each Ay: € > & string of 
binary letters [K¢ (which very conveniently we 
may too call a word) and for any sequence 

Na eh, 


sequence. PG [Gasser 


to send the corresponding 
Kem * 

For example with the correspondance : & ‘ 
Most Peps Ag ree =e Ne eee 
Ay SETS EPS 

we Would have : 


Dyas Mae ag ot ore eee 


/ 
It is not obvious however how the set lake of the 


words U: has to be selected so that decoding 
be free from ambiguity : 


At my knowledge, the question has been raised first 
and practically solved by Sardinas and Patterson 
in a pioneering paper(1). 


With the help of semi group concepts we may how- 
ever obtain a deeper insight into their whole 
procedure which was purely logical ; 


We are looking for a total translation from fi 
to fl and it is quite axiomatic that the decoding 


is unambiguous if and only if the sub semi group 


ny . generated by M. is isomorphoric to the free 


semigroup /\ = or - for short - that M’ isa 
free subsemigroupof | . 


Algebraic consequences of this simple remark are 
to be found in appendix 1. 


Now would come a fourth requirement : (admissibility) 


The length of the words ‘.: must be as small 
as possible in respect of some a priori 
probability distribution on /\ . 


As a matter of fact (4) will be met incidentally, so 
to say, in wiew of another condition we put in 
definition III : 


That the translation from !1 back to A 


be sub total : 


What this means exactly is that any sequence |} of 
binary digit be a left divisor of at least one 
message }/'€ M/ which can be completely and exactly 
retranslated into A . 


This condition together with the possibility of one- 
to-one deciphering implies automatically that the 
code be unitary (as defined below)(see appendix 0), 
and admissible in that sense that it meets the 
optimality requirement (4) in respect of at least one 
a priori probability distribution of the words. (*) 


3- Discussion of the decoding methods : scansion 


This being settled we have to look more 
closely at the decoding. 


For avoiding repetition let us observe that 
1. does not play any role by itself since the 
Ae € A, are in a one-to-one correspondance with 
the words ;t& Me e So we may perfectly well 
dispense from mentioning it altogether. 


But in order to stress when a given string 

(4 of binary symbols is really a set made up of a 
sequence of words and not any odd sequence of + and 
- we shall say that }* is a complete message (for 


instance : "i+--+- "= p: 3 isa complete 
message, but "*+-- " is not) and indicate it by 
enclosing it into two signs, which shall denote 
too, end and beginning of the words. 


let us try to decode the following complete 


message in code Ey: 
Jet--+--+--+] 
The only way open is trial and error : the first + 
may be: 
- either 4, itself 
- either the first letter from p, = /+i-j 


so that we have the choice between ; 


Jefe--te-ro-F] ana [+e-|-e--+--#/ 

In the first case no further doubt comes in and we 
are lead to : 

fefri--[rl--l+F le! = MMi Khe Mi My pi Ma Ky 


(irr M is the free semi group of all phonemic 
sequences in English and M’ the sub set of all 
"semantically correct sentences", MM’ is neat 


in’ Me %s 
for instance : 


id /pri wat law cut chur coco feet .."(obtained 
from King Lear, Act III, scene I, with Tippet's 
help) is fitted into a complete message in M’ by 
adding : "... and this, Gentlemen, was, may -be, my 
best example of a semantically void utterance ‘| n) 


h9 


In the second we obtain : 
as eet ad bah al i cael Oat 2G 

Since here —+ is left at loose end (strictly speaking) 
the first translation was the good one, being known 
that the transmission is over. Observe that if, on the 
contrary, the signal was the same as before’ except for 
an added terminal - digit, the conclusion would be 
exactly opposite ; 


jke-J-r-f-#-]-+-] 
is the only fitting "scansion" as we could say by 


borrowing from prosody this term for its classical 
flavour. 


So the inverse translation from | back to 
J. does not look like satisfying very reasonably 
the above condition 3. 


An obvious remedy to it would be to limit 
still more the set HW’, . B, Mandelbrot, who has 
first discussed these problems has distinguished 
several possibilities : 


1) Uniform codes : in which every word has the same 
length (i.e. mumber of letters), this criterium” 
Sens a direct scansion (examples: all the noise 
reducing codes introduced so far except for a 
proposal of "sequential coding" by Peter Elias()) 
and some examples by Lemnael(5).) 


2) More generally : what we shall call : 


Unitary codes : i.e. codes in which no word is a 
left divisor of another word (examples : Fano's, 
Huffman's, Shannon's codes) ~ 


3) Natural codes : (introduced by B. Mandelbrot) in 
which a special letter points out the end of the 
word (example : most of the spoken or written 
languages). 


Further, Mandelbrot: has shown that any unitary 
code is,at least agymptotically,as good from the 
point of view of economy of length as any other 
one. It could seem futile then to care for more 
extensive classes were we not prompted by other 
circumstances = end especially by the threat of a 
noise. 


4. Noise absorption and eryodism. 


“Consider indeed the following code: (6, 
Wis to byacth 5 Ma ects page= 
(which is, parenthetically, just the previous one 
with the time arrow inverted) 


It is unitary all right so that we may 
represent it by a "tree" in the familiar fashion : 


(south west arrow : + 


ima 


—_ 


south east arrow : ) 
hy 


hz 3 


The "neat" condition (subtotality of the trans- 
lation from M back to A is reflecting 


itself in the fact that any branch of the tree 


ends with a word (for example the code },= +} 
Hp = -#r > fy =-- would not be neat since no word 
nor sequence of words may begin with /-+-... i: 


Suppose that we have to decode the sequence : 


/- SE Se a 
we obtain directly : 

[-+-[-e-lej--tel 7] 
and we could have written it dom extemporaneeusly 
without waiting for the end of the transmission. 


But if the first digit had been blurred by 
noise, this straight forward attitude could not be 
kept : indeed we decipher the uncertain message 


(ieee, pe ire 
either as : 
lele|--lelmpcho+<l= ec 


either as above : 


ies aire eer) ak 
and as long as the message is going on we have no 
evidence for deciding between this two interpreta- 
tions. Things nonetheless are not so bad as they look 
at first glance : 


Suppose that the next letters which appear 
be Foe gS ES i ins 


so that up to this time the two alternative versions 


Tees 
2 ili 258 rb acl Ml IRL (ig PARE MEIER PG gen cep i 
j- + fm & -f4/—- -Jsj- -[tli+g- + kA el 


By the seemingly fortuitous fact that in both case 
the end of a word falls exactly as the same spot 
(mariea // above), the two translations coincidate 
from this point on and since one of them must be 
right so is the end of the deciphering = assuming of 
course that no new error of transmission takes place. 


Practically, if such a fact was frequent 
enough, this would mean that for very low levels of 
noise, considerable parts of the "meaning" could be 

reserved. We shall see that this ergodic property 
i.e. this relative independence for leng sequences 
of the scansion of the end from that of the beginning) 
is the rule rather than the exception. 


More specifically, for neat codes whose 
words have_all a bounded lenght an apart from three 
exceptional families t}>re is at least one finite 
sequence of words - say fx such that whatever be the 
initial sequence u, ayp,,/ i8 4 complete message. 
This implies that, when decoding, any blurring or 


error in «a is "absorbed" by . and that from 
the end of [4 on, the scansion starts all right 
afresh. 


Now if the words are given randomly and 
independently with fixed probabilities, it is clear 
that the probability for a given sequence not to 
contain }‘.. tends with its length exponentially to 
zero so that any initial error is most likely to 
have only limited effects. 


50 


5. Syntactic equivalence and the fondamental 
semi groups. 


Suppose we be given in code 
following fragment j+ 
p = 
By trial and error we see that only three scansions 
can possibly be fitted to it : 


Ce 


from a message : 


the 


a ee ee ee ae aa 


1) ...jef--f4]--jt)-e-f- 
2): san) ifieis fa pocmpeges oe ieee 
ee +-[ - t -[- te pe] Fy pathy 


In the same manner the fragment 


UA joa Ke Re i sine 
would give alternatives : 
1) Sel-e= | ep one 
2) - tf-e-[efe oe 
3). Se t-[ef-e-fone 
Disregarding the "neaning’ of i and j4' 
(i.e. their eventual decoding into the /\ language) 


we may observe that "functionally", so to say, /« 
and ‘are quite similar : 

If the complete message is /K,H Kz] , the 
only possibilities are for each of the three scan- 
sions : 

1) H,is a complete message and #2 starts with 
«/.. or Pe /se Or fal go as to make 
use of the /-.. left at the end of » ). 

2) -. is ending by ../-+ 


(20 as to use ..+/) 


and 4, starts as above. 
3) Hs, ends with ../- (for the sake of... +-/) 
and 2 is a complete message. 


Basy check shows that the same applies 
exactly to 4’ and we shall say that #* and }é’ 
are syntactically equivalent ) (Hep )-z 
Actually both are equivalent to an even simpler 
fragment : 


}" ‘ae Voor t- Pt 
since this last one admits the same scansions : 
Le eof 4/—veente2)ink 5 43/eieute ae ind trate ae 


@® It is interesting to observe that syntactic equi- 
valence has a direct application to normal linguistics: 


If &’ is the set a@f all sentences grammatically 
correct : 


b= Ke (approximatively') if and only if ». 
and @, pertain to the same grammatical catezory 
(for instance in English : both ‘adjectives; or both 
"verbs at the third person of the present" etc.) 


: Now the key point is that for any four 
finite fragments, », , bh, ; HK; and Hy 


B= Ka and ,= Ky implies p.p3 She My. 


The syntactic equivalence is thus fully 
compatible with the semi up structure of M and 
if we consider classes for = (i.e. the subsets 
of elements from M which are syntactically equivalent 


between themselves), these classes make a new semi 
group H which is an homomorphic image of MM. 


MH> Mo , the fondamental semi group of the 
coding (f.8.g) is most usually finite and is easily 
represented by matrices, but before we explain how, 
we need still a new concept : that of prefix : 


Consider again two fragments / and jv! 
but assume, now, that both are beginning at a / mark: 


Even if }A and pn! are not syntactically 
equivalent, it could happen that under this supplement 
ary restriction any further fragment which completes 

A into a full message would do the same to }’ : 


One could say that " 4 and +’ as begin- 
ning of messages are syntactically equivalent on the 


right" (in symbols : ,.~ p’ 
For example : 
w:|--- and i oa j Vice not 
So ld [oetiad it Vk) 
in the relation = (since++M*: isa 
complete message although /—+h'=/++i/-... is 


not complete), but }A ~~ M’ all the same for }" 
is a complete message if and oly if 


Heard 


just as well as for 


or ++/.-- 

‘ 
K e 
We call prefixes the classes 
fragments for this new relation ~wW . 


or +-/>5 


Tz of 


For the code 
prefixes : 


TT, 3 Ig! 
bringing a jp € Thy 


O) 
Gr » there are three 


(words and words only are 
into a complete message. 


tT, | contains all the words and its existence is 
typical of unitary coding). 
TT, 3 [---> (the corresponding right divisors 
are ~/.. » te/.. and +-/+:: 
Ng2 | Ps see (the corresponding right divisor 
are +f. ae or - [os 4 ° 

Nowif M.\ M2, one proves that 
bids © Babs , too, whatever be 3 


With unitary codes prefixes correspond to 
nodes of the tree in a one to fashion : Two 
nodes being in relation ~w Crogreaas to the 
same prefix") if the subtrees below them are 

G code 


identical. Such things does not occur in our 
(see below), but are quite typical of uniform codes. 


51 


In the code 6; yes ere: 


of length 2 (i: ++ 


sz-+ 3 fpys 7-7 +) there is only two prefixes : 
one, 114, corresponding to complete messages - i.e. 


to sequences with an even number of letters - and 
another one, ‘Tz , corresponding to odd length 


sequences. 


6. Matrix representation of the fundamental 
semi group. 


If we have started reading just at the 
beginning of the transmission, we may consider at any 


time 4 the prefix 17 (4) to which pertain the 
initial fragment till the 4 -th letter as a "state" 
which changes at any new letter received. 


For instance - apart from any meaning again = 
the sequence |; -- = +---.corresponds to the following 
sequence of prefixes : 

i, Th, Te ThSigstaat, Th, Tp 


It is easy to visualise "+" and "-" 
respectively as the transition matrices : 


th Wh Ta Te TA ae 
Tl die howe Fo i> Se 
Tia cael tal Sr OnnO 
uF ae ae a) leg ie 0 0 
(+) (er 


(+ lets TI, invariant since it is a word. It sends 
Te into 17, and makes a word from 71; etc..) 


These matrices correspond in a one to one fashion to 
the elements of the fondament semi group, for instance: 


(with the usual line by column multiplication) is the 
matrice given below. What are the matrices correspond- 
ing to complete messages ? In the general case they 
are the matrices of the subsemigroups ves image 
of M‘ by the syntactic homomorphism . 


But if the code is mitary, 1’ is characterised very 
nicely, since p¢ M' implies that 4 sends fT, into 
itself : M’ is just the set of the matrices of 
with 4 in the top left corner. 

Further, noise absorption - or ergodic - 
properties reflect themselves quite directly on this 
matrix representation. 

Suppose that the correct message te ves 
and the perturbated message /v'... fall back both 
at this very time on a common scansion mark i e 
the prefix corresponding to »« was TI and that 


corresponding to }«' was i’ , this would mean 
that the next signals sends both T and 71’ 


te} OT ° 


On the matrices this is expressed by the 
fact that in colum 1, there is two 14: one in the 
line 1 and another one in line TT’ . In particular 

Hoo is a matrix with 4 everywhere in colum 111. 


But this in turn is linked closely with the fact that 
M is a semi group and not a group (whose matrices 
should all have a single <1 by colum). 


Consider as a counter example the umiform 
code with four words ; 


Its f.s.g. is just the cyclic group of order two, 
made up of the two elements : 


Ti, Ty in Ty 
Wyle AL ilar = 


(+ or - or any odd length 
sequence 


( p or ++ or f=" 0% 14: etc. 


Tq | 4 0] nfo Bey or any even length sequence). 


a 


No real absorption takes place for indeed 
if we had missed the first letter of the transmission 
and started wrongly scanding from the second letter, 
the error will obvious go on as long as does the 


message. 


As a matter of fact wuiform codes are the 
only neat codes with a bounded length for words whose 
f.seg. is a group. They are the first exceptional non 
ergodic family. 


Te Super coding. 


We have given a very general definition of 
"translation" which suggests the possibility of more 
complex processes involving not only two but several 
languages. In the general case, things are a bit 
confused and we shall restrict ourself to Unitary 
Neat Coding from K into /\ and from / into |. 


52 


Suppose for instance that we have the 
following set up ;: 


{is a des.g.l. with words K; (4¢ 4¢ 7) 


J. is a dss.gel. with words di (164 ¢ 4) 


M. 4s our familiar binary d.s.g.l. 


Each word of /\. is coded inM as in example 2: 
eaas oi Vie) aa rpc Ye Se 

Each word of K is coded by the following 

sequences \) orf SA (for clarity we use upper 

and lower indices) : Ki A hs A, Keo Meda, sar gaan 
K,—> A"= A325 Kg 7 Mz dgd3 3) Ke Pas ds an, hr 9 Ws da, 


This coding is unitary and neat all right and 
corresponds to the tree ;: 


Pa 
Abs 


Now there is again a coding of k into 
is written in binary alphabet : 
«> Uy K> 7 Sier K, -? Beeat) 


when every 4’ 
Ky cece 


Ks DN Met oes RP ts eh II) Oi mee GMS IS 


It is not difficult to see that this K— t. 
coding is unitary and neat. Its tree is given below. 
Since we know the importance of 
= fundamental semi groups we would 
be interested to get at once 
that (4 ) oft kK 7H process 
from the other two (A for k-»A 
and M for A>) or, 
ternatively to know the 
Z relation between the syntactic 
equivalences on the bottom structure Mia a (A 
in respect of AM , without K appearing in the 
picture and =(K) in respect of KH with A put 
off from the circuit. 


The main result is that : 


bd CA) entails bf. (M) 
or, if one prefers, that M__is a homomorphic | 
image of me ° 


This is rather convenient from a technical 
point of view for it allows what is called a 


filtering. If starting from the assumption that the 

Ay are provided independently with fixed probabili- 
ties by the source, we discover later on that, actual- 
ly, they were just building blocks in some higher 
degree semantic units (sent again independently of 
each other as a second approximation) we can preserve 
at least some of the features of our initial 
approximation. 


But the main point for us here lies in an- 
other aspect. 


Suppose that the kK /\. coding be uniform. 


in general the KM one will not be so, but it will 
fail to be ergodic just the same, giving us the 
second of the three exceptional families mentionned 
above. We shall call such codes “uniformily composed 
codes". An example is given below 


ae es 


K— + wniform of length two 
I. -»M | our usual So 


The nodes indicated with a o are the ones 
corresponding to nodes in the K> A 


coding. 


8. Anagrammatic codes. 


let us come now into the last family. For _ 
this we produce the following horrible example : os 


anette, A ate tae pA Se i 
me 


Fel estes 0 lca? cc 


tar, US Le A pe (al 2m i OR a Pye 


a is not uniform - nor 
composed uniformily of a 
smaller code. But it has 
the property that by inver- 
ting its words we found again 
@ unitary code and, indeed, 
its symmetric image 
(symmetric in respect of the 
N.S. line 1) 


Since ergodicity is somewhat 
synonimous of irreversi- 
bility of time, we are put 
on the alert by this oddity. 


1 


53 


Indeed, absorntion is linked very closely 
with the problem of reading "backward" messages with 
an inverted code, but,without entering this amusing 
theory, we can see at once that ¢, and all its 
family are not ergodic. 


If a code is unitary the only sequenceswhich 
let i, invariant are the complere messages, whose 
set is MM’. In symbols, this means 


KB. Bie H' 


and (4, ¢€ M’ implies mz ab 


Suppose now that the same property be true 
on the other direction, i.e. that we had : 
Mi ee EH! bh. © M' implies pi, € M' 


and 


Let (4, be a complete message which is the 
Bs 


its noise corrupted form and |1z any other complete 


unperturbated beginning of the transmission ; ; 


message. By the above condition 1M: may have a 


final scansion like that of |4,m: if and only if 
M's, is a complete message, too. 

As this is usually not the case the error 

will go on till the end. 


Codes which are unitary for both directions 
of time ( ammatic codes) are not yet fully 
explored but a construction for various infinite 
families of them is known. With binary alphabet, 
there is just the one given above and its symmetric 
for less than 16 words. It is conjectured that 
there is still no more than 38 other one below 32 
words (on about 10! distinct usual unitary neat 
codes of this size or less!). 


So the family is really exceptionally 
interesting and deserves further studies since with 
the uniform and the uniformily composed codes, 
anagrammatic codes are the only length-bounded codes 
escaping ergodicity. 


we me ee a a ee a 


References. 


1. A.Sardinas and Patterson (1953 ) Convention re- 
cords of the I.R.E. 
2. BeMandelbrot . 
1953. ProceSymp.Comm.Theory. London. 
1954. Proce Symp. Inf. Network. 
1955. Proc. Symp. Comm. Theory . 
P. Dubreil. 
191. Mem. Acad. Scie Pe 1-52. 
1951. Rendiconti di Math. 81 Pe 289 = 306. 
1953 Bull. Soc. Math. (10) p. 183 - 200. 


London. 
3e 


le P.Elias e 

1955. Proc. Symp. Comm. Theory. London. 
5. A-E.Laemel. 

1953. ibid. 


elements : X, y, Z 


Part II rEQ 3 ze ¢ WA 5 yzeQ imply sy €Q- 
a t-1) 
Appendix O : Some notations for semi group concepts : In symbols : Ira ‘ae @ fa) Q OM Re Q 


A being any semi group, capital letters 


will represent subsets of it, small letters being Proof : 
reserved to elements of A . It will be understood 
that : let u= A, Ap eee By, be a finite string 
XY =Z means that Z is the set of all elements of letters pertaining to Q . Write u,;,/ as an 
z=xy in A obtained as product of an x ¢€ X by abbrevation for the subsequence a, 48, «.. aj, 
aneeVic la. obtained from u_ by amputation. 
Residuals : Call "critical indice" of u any indice i such as 
x and y being two elements of A , the u,; and u,, pertain both to Qand let Ju be the 
=f) =f - 
notation x y (resp. y x ) denotes the(eventual- set of the critical indices of u. 
ly vacuous) set of all those elements 2 which We prove first : 
satisfy xz=y (resp.zx=y ) If i and it¢ Juand jéJdu;; then j € Ju. 
z is called a "residual". Various notations for it Indeed, by hypothesis, u,; , U; j? u 4 
are to be found in the literature. The one below . i 
seanpathe aidstereteal ls and u;,, pertain all to Q so that the same is true 
x and Y_ being two sets : of ay es es and Sia eee 
x°" y is the set Z of all z suchas xz€Y Now : 


£98 ee stone *€ < A n.aes.c. for Q to be a free semi group is that 
i h 
x Y is the set of all z suchas xze i ,o4?e0uaten tel Iaaeuee Q 


for all xéX. 


ei — Tet Jin | ae a a es GET ha a-t 
The same holds mutatis mutandis for YX and YX . m 


The condition is sufficient for : (1) by the above 


Unitary and neat sub semi group. lemna, the Uk, ip, ‘4 cannot be broken further and 
These two fumdamental notions are due to thence are words; (2) there is no alternative scansion 


es iJ 2 8B eo 
Soc.Math.81.1953 p.289-306) critical indice i' not contained in Ju. 


X is a sub semi group of A if and only if : 


ze coy The condition is necessary. Suppofe u,; / ¢ @ys 
— CAe 
X is unitary on the left (right) if and only if : Since j = 1 and j' = n fulfill (1) and (2) below, 


there exits j, j'€ Ju such as: 
xyéX and xcX (yeX) implies yeX (xeX) 


In symbols : x‘ "x ex. (on the right: mc phe (1) Bee es ey 
(2) Usi 9 W5a0 ’ bee , “5y" are all in Q 
X is neat on the right (left) if for all x there 


(3) |s* - 3| be a minimum. 
is at least one y with xyceX (y xex)? 


ome ieee ‘3 guild the ere ioe x2 A). c = ujjt admits at least two different scansions: 

one by chopping of Ujit and ujyr-s into words; the 

Appendix I. Generalized Sardinas and Patterson’s other one chopping of Uji and Us.4. These are dif- 
Nass eb 2 os Cage reae ferent for if u.; and usi+ - for instance-had a} 

common complete’ message’ Us::, as left divisor, the 

Let M be the free semi group generated by couple (j'*, J') would satiety (1) and (2) above and 


j', j') would not be minimal. 
the letters yess ; & its sub semi group generated Gb opaalty) n 


by the words [ki € (J, , these being for the sake 


¥ The demonstration is finished for it is 
of the demonstration any finite strings of letters. 


enough now to take Reus» ¥ sss and y5Zs= Ue 


‘ ij! 
Theorem: A n.a.s.c. for Q to be iso- for obtaining the terms of the theorem. 
morphic with a free semi group is that for any three 


5 


Remark I. If Q is (left or right) unitary it cor- 
responds automatically to a code for ONS QD 
1 (2). 


Infinite families of codes which are not 
unitary neither on the left nor on the right may be 
constructed but no example still is known in which 
all words have a bounded length. 


for instance implies obviously 


Remark II. Refinement of Sardinas and Patterson's 
method leads to an important result which we do not 
prove here ; 


A neaesoce for the existence of a fixed finite 
number L < cO such that the knowledge of the m +L. 
first letters of a message allows always an 
unambiguous decoding of the first m letters what- 
ever be m is that the code be bounded and unitary 
on the left. 


Appendix II. Syntactic equivalence and related 
concepts. 


The notion of syntactic equivalence had 
been already met in 1951 (C.R.Acad.Sci. 232 p.I987 - 


1989) by M. Teissier working on abstract semigroups, 
but this concept does not seem to have still received 


the attention it deserves. 


e Actually, we shall state the main results 
for a slightly broader relation : 


" 


/ 
> (x) defined in any semi group A in respect 
of any subset K # $ of A. 


II.1. ab (K), if and only if, whenever xby¢ K 


then : xay¢K .. 
In symbols : 
a>b (K) Sahipear Sa PPh 7 2 as ae 


A 


We prove,now, the following algebraic 
properties of > 


Tis2. (K) (obvious). 


a >a 
II.3. a > d (K) and b >c (K) entail: a>c (K) 
(obvious) 


II.4. a > b (K) and c>d (K) entail: ac> bd (XK) 
(if II.1. is true for all x, yeA, it is 
still true for all xcAu and ye «K; thus, 


owing to associativity : 


a >» (K) entails : 
be u,v € A. 


uav > ubv (K), whatever 


In particular, if c >d (K), one has : 
ac >be (K) and be>ba (K) 
iee. : ac>bd (K), in view of II.3). 


If is any relation on A satisfying Il.2, 
II.3. and II.4. and the further condition : 


IT.5~ 


55 


II.6. 


£1573 


II.8. 


apb and b¢K entail: acK (for short " K i 
upper saturated for f "), then: a pb entails : 
a>b (K), (Firat, a>b (K) and b¢K entail : 
a«K : it is enough to make x=y=fQ@ in II.1. 


Now, if is as described, afb implies : 
xby and, by the supplementary condition, 
whenever ; xbyéK, then : xay€K so that, dg 
entails a>b (K)). 


xay f 


We define now : 
azb (K), if and only if, in the same time 
azyb (K) (K) 


In view of II.3,4,5,6 : 

(II.2)! (K) 

(I1.2)" (K) implies: bza (K) 

(II.3)! (K) and bzc (K) imply: a=e (XK) 


II.4)! K) and c=d (K) imply:ac= bd(K) 
are ty and acK imply: bé€K 


\ 4 
and =(K) is maximal with these proper- 
ties. 


and b>a 


aza 
ab 
a=b 


a=b 
a=b 


Not every congruence on A is surely a 
syntactic equivalence but at least we can state: 
If? is any congruence on A with classes KX ' 

K yoo » KK, , then, a pb is equivalent to : 


azb (K,) and azb (K,) and.azb (K,). 
(By the very definition of f » apb entails 
xay ¢ xby for all x, y¢eA, that is to say : 
xay € K. = xby ¢ K, 

Conversely suppose a and b not in relation 
(3 then acK, and beK, , with if i' 
and accordingly ag (K,). 


The problem of determining in a usefull way 
just how many and which of the K,'s are enough 
for reconstructing in this way a ziven 4 is 


still open. 
The meaning of = will be clearer if we consider 
the special case where A is a finite group. 


If Aisa finite group, a>b (K) implies 

b>a (K) and is equivalent to: 4° bé G 
(1\x'kx = the largest normal subgroup 
a Thus, azb (K) is equivalent 


contained in K. 
azb (G). 


(If A is a finite group KucK entails Ku = 
K, whatever be uCA. Further, if f is any 
congruence, there is a normal subgroup G of A 
such that all saturated K's (i.e. all subset 


to 3 


of A satisfying : aeK' and apb entails bcK') 


may be put under the form : K* = U RG, , 


ale 
Now II.2 may be written : Rye vin ce *, forall! 


x, y€A so that ab (K) entails a=b (K). Owing 

to II.5, there is (— » normal, with K = U Koy 

From the above epee form of II.2, it EAS in 

partioular that! & BG (> implica gate e(R)ivand, by 

“the maximal property of :(K), that @ = flea K x). 
XE 


Fondamental semi groups. 
Call * the semi group homomorphism assocd.- 
ated to = (K). 
II.9. a>b (K) in A is equivalent to: 
yar Gb (eR) in en 
(On one hand : xby €K implies pxty = GX qe 
yy t {Ke Qn the other hand, since K is upper 
moans for >(K), ees 4y ce ae entails 
xby CK ). 
IT10." In » ya Fy (qk), if and only if : 
a = Yb. (Since by II.9 Gaz qb (¢ K) 
implies a=b (K) i.e.ya = ye in qA)- 
So, once applied © to the original A, 
iteration of the process in respect of (;K does 
not give any further not trivial homomorphism, 
or, as one could say : qx is syntactically 
simple iny A. 
We recall that if ay is any homomorphism and d~ 
the congruence defined by : x@ y if and 
only if = WY the fact that a complex Uv 
be saturated for g is equivalent to :Y'UcU. 


This again is emg erent to : 
For all x: y (0 POEs vo yx). 
(ave tt = oes ax £U, which entails ya yx 
S yu i.e. yar WU ( 7e.9 jst 3 conversely, 
Ya px = YacpyQ entails ax = q€Q; this 
is the very definition of saturation). 


The following result allows the building of 
all semi groups ( AK) such as (y ADwK ) be 
isomorphic with a given ry > K where K is 
syntactically simple in re 


II.11. Be given A>K and an homomorphism } . 
eSeSeVe = % 

sa anaes TOL Ape kN 
yykc K. (Necessary : if aca-K, 
k<«K and Yk = Ue. = ké WK; then : 


(Pecvealpyev)e mnt fad tyke 


56 


Sufficient : if azb — entails acb, then it 
entails: "ya =yb (*pK). A 


Now, we prove that preserves features important 
for the translation theory : 


IT.12. let Q=K> & be a sub semi group from A. 


Any one of the following properties is true in the 
same time in A and 2 Hons 


Ne he z 
e0n dG ca; Qn he OAY 
(obvious due to the saturation peopenty and the 


fact XY‘ is just a notation for UY Xy! De 
seY 


Prefixes. 


Definition : we call right prefixes the equi- 
valence classes for the relation : 


acwb-, if and only if: a’K= bK. (is the 
"principal equivalence" of P. Dubreil) 


II.I3. awb entails: axwbx for all x and: 
&:p(K) is equivalent to: yavyb for all yeA. 


(Classical: au éK => buéK for all u, entails in 
particular ; 


axveK > bxvé K for all v, by specialising 
u= XVe 


Now, a=b (K) could just as well be written : 


or «( - 
(ae) K = (bx) K for all x, or : K (ya) = K (yb) 
for all y)e 


As an immediate consequence : 


II.14. The representation A” of the xéA as 


applications of the set fel. Of the right prefixes 
fore itself is an isomorphic representation of 

A pa. 

(a, b«é T< entails ax, bx € 71, v7.Xx. On the other 
hand T;x~Thy for all Tj; ¢€ 4 , entails x = y 
(K) and reciprocally.) 


Of course "left prefixes" do exist just as well 
and play some role in the theory. In general, their 
set “[| is quite different from []* and, accord~ 
ingly, the two matrix representations of wA as 
translations on the right or on the left - althouh 
isomorphic - are not in general equivalent. In 
connection with these points it may be useful, at 
times, to visualise in a somewhat different way 
the notion of prefix ; 


Let K be the relation between subsets of A 
defined by XxY= KXY¥cK 


II.I5. The ae XK (x° Pe X “and 


x3) (ex" ee ae *X are the Galois closures 
corresponding to ‘<and the closed sets of the 
type X” (or *X) are the right (left) prefixes. 


(Indeed, by definition ,the right prefix ;” contain- 
ing x is the set of all x' such as whenever 

xy <K , then x'y «K too; if making X = x, we 
obtain exactly rt;= X*. The link with "Galois 
closure" concepts is straight forward (see for 
instance : Birkhoff; Lattice Theory, chap. IV) ). 


This remark leads very easily to the following 
useful proposition : 


II.I6. Let A be a free semi group generated by the 


finite set of letters {A,/and K =M be a sub 
semi group of A generated by M,. 


If M is isomorphic to a free semi group and if 
every word in the generating set Me has a length 
bounded by L<oO , then A is finite. (Owing to 
Theorem I.l. ax€M implies that the first critic- 
al indice of ax at the right of the end of a be 
separated from this point by less than IL letters. 


Further ax¢eM implies that axsméeM for all meM. 
So, the set of all x such as axeM for a given 
a has the form M'M where M' is a subset from the 
bounded _set of all right divisors of the elements 
in Mo. It follows that there is only a finite 
number n of right prefixes and from II.1I5 a finite 
number, too <2" of right prefixes. The proposi- 
tion is then a direct consequence of II.I4.) 


Finally if dropping the boundedness condition for 
M. we add that M be unitary on the left, we 
obtain a partial result which will be used in the 
next appendix. 


II.I7. let S denote the transitivity closure of the 
relation ~c defined by : a <.b whenever exist 
m, m’<¢ MUM and céA with : a=me and b=m'c. 


Under the above specified conditions : ax b entails 
arvb and each ~*-class jj: contains a single 
well defined element c;admitting no mc¢M as left 
divisor. 


(By the definition, a <:,ma so that one may can- 
cel as many as initial me«M as to obtaina c 
which cannot be left simplified in this way. 
Obviously no two such c and c' can be in 
relation ~,. 


If one had c 2-c', this would mean that for a 
family {a ;,{ there exists the relations : 


(e oO ay A) At 89 > 2eoee > a, Me cus 

But c an, By and By MLA imply : 

a, = me and a, = m)c) ype = 1,C, 

with mcM and c, lef¢- simplified as above. 


Owing to the fact that M is isomorphic to a free 
semi group, this implies : 


cr and m) =m i.@. : C M,8 ° 


Finally, by the unitary character of M, ax = 
mex€M entails : cxéM and bx = m'cx€M 


isew : and). 

Actually, awb means exactly that, apart for an 
initial complete message, a and b correspond to 
the same node on the coding tree. A broader 
generalisation of the notion of syntactic equi- 
valence is skeched in App. V. 


Appendix III. Super coding : 
The results in this section are not optimal. 


semi group B. Let L be U.N. in B. This isomorphism 
sends L onto a subset 0L=NcA and we have : 


III.1. N is aU.N. subgroup of A contained in M. 


(That WcWesM is obvious. Owing to the iso- 
morphism 0 from LcB to NcM, if axéN then 


“i 4 
© (ax)¢L and if, further, acN y then Oa¢L 


so that L unitary in B (ice. OxeL) implies 

N unitary in A. In the same way N is neat in A 

if and only if LCB and MCA are both neath, 
III.2. Any class nae for ~ (N) may be represented as 

the logical product of a ~~ (L) class, ty » and 

a 72(M) class, ipl! 5 


(let ‘7. have c for minimal representative - 
as defined in II.I6.~ There is a unique de- 
composition of c = mm, ... me’ with mje M 


and c' : a minimal representative of aly ¢ On 
the other hand ;: m = MM, oo. M, is, too, a mini- 


mal representative of oie owing to the iso- 
phormism. ) 


III.3. If azb (N), then: azb  (M) 


(In view of II.I4 and II.I6 it suffices to prove 
that axvb (N) implies a~b (M). Let c = 
MiM, sees m,.c* be the minimal representative for 


the common ~(N) class1j, of a and b. Any 
de tj" has the form 4 =me with méM so 


that a! M depends only onc ). It is an 
immediate consequence of III.3. that if 4, and 
44 are the syntactic homomorphisms attached to 
N and M ;: 


III.4. «A is an homomorphic image of gee 


MH 

It must be observed that this is not generally 
true when N is any sub semi group contained in 
M in contrast with what happens when A isa 
group. The only perfectly general result is the 
far less interesting : 

III.5. If K and XK* are two subsets of A: 
aSb (K) and a>b (K') entails : a>b (KnK') 
(xby ¢K nk’ entails xby €K and xbye¢ K' 
The’last relations entail xay«K and xay <K! 
i.e. xay€KNK'). 
For filtering, one needs the very simple : 


IIT.6. M being U.N. in A, there is for any pair 
of bounded Ve Cok at least one bounded N M 


For the sake of simplicity we state the propositions 
under a form which is valid in the case of neat 
unitary coding. Some of them would have to be radical- 
ly altered in the broader set-up alluded to at the 
end of App. II. 


A is any semi group and K=M a sub semi group of 
A, unitary on the left, neat on the right (U.N.). 


Thus - (from II.1) - M is isomorphic by 6 to a free 


57 


which is U.N.°“in A and such as: x Ay (N). 


(The only case of interest is x=y (M) :- 


= ’ e t i] ' Re 
let X=M M -- mic z mi, mM, oem, Clay 


(m; © M) and cdeéM. Take for NcCM any 
U.N. code admitting x d¢M. and not y deN, ). 


Appendix IV. The main ergodic theorem. 
¢1) Let A > Q be any two semi groups with 
“Wg GNVQ QO bG-ac in. 11.3. 


Definition: 


A will be said absorbing for  @ if what- 
ever a,b € A, there is at least a bounded 

u- depending eventually on a and b-= 
such as :; au™~bu Coie 


Observe first that the problem is only 
interesting if Q is neat on the right in 
A. If not, there is indeed a trivial prefix 
TT. into which any x is sure to fall sooner 
or later ; this being characterized by ; 
XE No entails: for no UU, ZeUpe sO. 
Thus we are led to include the condition 
that Q be neat on the right in A. It will 
be very convenient to restrict even the problem 


to the case where ; 
1) A has a finite basis Aa: 


2) Q is strongly neat - i.e. any bounded x€ 
A admifs at least one bounded right residual 


u€x Q, 


By having read "backwards" the current 
paver, we know that all this implies that Q 
be unitary on the left. If not we look for a 
proof at appendix VI. 


Vie DUO CAT en macs eee torvA tombe: absorbang is 

that for all x € A there exists at least, one 

bounded q€Q with x qéQ (i.e. qEéxQ). 
(Sufficient : Let x y€Q and y q'E Q; 

Q being unitary : x qwy q'. 1 

Necessary : Let x€Q and then v€u Q then: 

Q>x u vwy uv and u vEQ.) 


IVe2. Tf, the conditionstof 1Vel. are fulfilled. 
there is for any finite set X ={x,, Xoreeeese Xe } 
@ common bounded absorbing u . 
(i.e. satisfying: x, u€Q for all x; € X). 
(Take first : a,€ ar Q (F @ vy hypothesis) 
then: I, € (xq) 9#h etc. 
U="G, Ap verre AK ) 


IV.3. A sufficient condition for A not to be 
absorbing is that Q pertains to one of the 
three following families : 
1) Uniform codes 
2) Uniformily composed codes. 41) 
3) Anagrammatic codes (i.e. with Q of G9) 
or to any coding obtained from these by 
U. N. supercoding. 

(1) 9 is the set of all sequences whose 
length is a multiple of k #1 so that x¢Q en- 
tails : x a¢9 for all q€Q. 

(2) Q is the set of all sequences made of 
k #1 words from P, where PDQ is U.N. If x€ P 


- 9, so is it of every x q with q€Q. 
(3) Q unitary on the right implies that: 
xX q€ Q and q€Q only if x€Q). 


Iv.4. If Q admits a bounded basis the IV.3 con- 
dition is necessary. (Proof involves the full 
theory of Suschkevitsch's Kern Gruppe plus some 
appendices. However, let ft be sketched roughly.) 


A =A » finite, admits a minimal sub demi group 

such as xQ@y CG for all x, y@A. is the 
minimal bilateral idealof A. @ can be built out 
of a group @ by some, not too immediate, method. 
(Suschkevitsch . Math. Ann. 99. 1928 . p.30=50. 
Cf, too: Clifford Am. J. Math. 71, 1949, p. 835) 
The condition for A to be absorbing is : 

a) That $ be reduced to a single element €. 

2) That Gbe the unique minimal left ideal 
of A . (i.e. G = D where D is the minimal set 
with xDCD for all x). 


The way the elements of A operate on those of d 
can be represented by matrices (Schutzenberger, 
C. R. Acad. Sci. 18/V1/1956) in an isomorphic 
way when A is syntactically simple. Provided 
thet Q is not unitary on the right ( family 
(3) ), if II is violated, there is a first homo- 
morphism, which reduces A to AY satisfying it. 


If I is still violated, there exists again a 


second homomorphism sending into A satisfy- 
ing I and II. The inverse image of for the 


first homomorphism is PDQ which is still uni- 
tary on the left ( family 2) unless we be dealing 
with a member of family 1. ) 


Appendix V . Introduction of prohabilities. 


Let A andA.be two d.s.l.g. with gen- 
erating sets ( or "alpnabets" ) A and SV. Let 
it be given a probability distribationPon® A,: 

P = Pr ( hoi for Noi€ A, and suppose that a source 
be generating sequences of u's , independently, 
and according to P . A coding machine trans- 
forms each of the successive Ag 's into a string 
of letters 94:O@-in A and a receiver is observ— 
ing continuously this sequence at the end of the 
communication set-up. Whatever be the sub-set Qe 
= {%i} (generating a sub semi group QCA ), 

we call the stochastic process observed by the re- 
ceiver the stochastic discrete semi group language 
ee P)- 


In what follows it will be systematically 
assumed that the transmission has started at the 
time finitely remote in the past and that the re- 
ceiver has followed the process from its very 
first letter. In fact, our discussion deals only 
with what could be called a "Prefix theory" and 
not with a full syntactactic theory - which we 
have no space to develop here. So the notation 
Pr ( ax] a ) will denote the vrobability for- 
the receiver having observed the initial string of 
letters "a" to see it later followed by "x". 
However in view of their simplicity and general 
interest - we introduce first some concepts 
broader than it is strictly necessary here. 


Definition.; (a, br) = (bia) rae leemremins 

“Tieg Pr (ax \ a) - Loe Pe Cbxib)| *€A 

ViOcl. Ga, ia 0sen0 Se Gedu<Ga cee 
(en We 


-0.2. : whatever be c: (ac ,bc)£2(a, bd). 
cits Pe © a chah)*# o ge Pre(aictzlalc):= 
Pr ances ice 
Pr (a c| a) 


Definition: The classes on A for the right reg=- 
ular relation b = ( a, db ) = o will be called 
the stochastic vrefixes. 

(®& is right regulat since (a, > ) =o 


entails ( ax , bx ) =o for all x€ A ) 
The stochastic vrefix to which vertains any a 
may be considered as a sufficient statistic or 
"resume exhaustif" in the sens of G. Darmois, of 
the past of the message and does not need for its 
definition that of a s.d.s.g.l. Relations strong 
er than A should be considered for such problems 
as Wiener's times series prediction. ; 


Weshe TARA SSO 5 | adobe ees for alisan, de QO, 
is that Q be unitary on the left and that 6 be a 
coding. 

(fo each a we associate the set a AMQ of its 
right multiples wich are in the time members of 


Q. 


We say that qCa AN is minimal if for all de- 
composition of q undér the form a= 4! 9" (9 EQ) 


a=q' a' with a' # gd. 


Let the set of all such minimal aca Af\qQ, 
be called the right envelope of d. By defini- 
fion : Pri( a’) => Pr (dh) 

Oe. ACO ES 
( @ is eventually a one~toe many operation in 


our perfectly general set up). If, and only if, 
Q is unitary on the left, Eg reduces to 4 it- 
self when g€ Q. In this case we have Pr (q c | q) 
= Pr ( c ) whatever be cE A , so that Pr(qclq) 
BRET sd eCG ls juieee: 
(oc. q)' J. when q , av é—Q; 


If Q is not unitary or if 8 does not give a coding 


for at least a counle g' = qr with dq, g'€Q, 
rec, fr Carla yf Pr Cr ) and ( q , 9') fo). 


If comparing with chapter 
ity Theory, one sees that the case of coding with 
Q unitary on the left coincidates rigorously 
with the assumption that the scansion symbol "/* 
is a recurrent event. This identity has been ob- 
served and its consequences develoved first by B. 
Mandelbrot ( C.F. C.G. Proc. Symp. Inf. Networks. 

1954 p. 203) allowing by this connection an en- 
‘largement of the theory towards the continuous 
case which we have systematically avoided until 

now. 


It will be observed that, in the unitary case sto- 
chastic prefixes correspond to nodes on the coding 


tree and thet what we did with our definition was 
just reducing the stochastic process under con~ 
sideration to the canonical form of a Markoff 
chain. It must be underlined that the number of 

classes is in general infinite when Q is 
not unitary on the left even if all words are 
length bounded. 


{ 


12 of Feller's Probabil- 


59 


Bpvendix VI. Admissibility of U.N. coding. 


In what follows we assume that the gen- 
erating set A of A contains a finite, number 
of elements. 

As in Apo. V. we consider a@.s.g.1. A generated 
by NE and any homomorphic application © of 
onto a sub semi group QC A; Ja| where a € A 
denotes its length. 


Definitions : Ne (t) = 5 S(LJeMvitn § (t, 1A) 
z4 if {son AeA 
zo if f¢lon| 

(N (t ) is the number of sequences in A of 


length? which pertains to Q, each being weighted 
according to the number of its Q decompositions. ) 


18 dail 
n(t)el-<¢ t 
h ( t ) is the un ructure function" of B. 


Mandelbrot. 
We recall first the well known result : 


VI.1. The generating function te} {t) profethe 
N(€) is 1/h() and N (N) may be expressed dy : 


w(t) = x 9a.) 


where Ph is the set of the zeroes of h ( t ) 
and LAI is a polymial which reduces to a con- 
stant A, if and only if fjis simple. 

In view of its imoortance we single out ig the 
zero of smallest modulus of h (-t ) . Owing to 
the expression of h ( t ) : ogp<eo and P isa 
simple root. 


Wola IN pecessary condition for_,to give a cod— 
ing (ie, oanad' Q) is that kK 1< ¢ 

(If not, for large enough C, NC { oe; so that at 
least one q€Q =@A corresponds to two differ- 
ent UE tes ; 


VI.2. A necessary condition for Q to be strong- 
ly neat on the right is that P< kK" 

(If Q is strongly neat, to every one of the ct 
sequences of length # in A corresponds at 
least one sequence in Q of length f'=€+ L 
where L<o@. So for large enough £ ‘7 
u(2) + w(OH) +... 5 CHD) > hk 


V1.3 To any structure function h ( t ) with 
fz2k" corresponds at least one Q which is 
unitary on the left. 

(h (7) 20 (Szilard's inequality). If the 
equality sign does not hold, develop h ( avr) 
in ascending powers of and consider it as 
minus the value for t= «-! of a new function 
Smee Sys 

ia (a3 9) ey (Ga) eatag 
function for which # = * 
Writeh (+t) = (i-kt)Gtnt+ met t.-) 

The Nissatisfy OSHS KN, . It is easy to : 
check that these inequalities imply that the Ns 
‘may be the numbers of not terminal nodes at 
length @& from the apex of the coding tree cor- 
responding to Q unitary and neat. The tree as- 
sociated with the originel h (+ ) is easily re- 
covered by pruning from 1% the nodes and words 
introduced by h’(- t ). Q is still unitary but no 
more neat.) 


t ) is again a structure 


Consequences: 


VI.5. The class of the U.N. codes is a complete 
admissible class and the structure function of 
an admissible code admits the root K7' . (The 
conatruction of a for which a U. N. code is 
optimal is well known. On the other hand if 
f=! - i.e. if the tree “f does not cor- 
resoond to a neat Q , it is easily seen how 
one at least of the words may be replaced by a 
shorter one ) 


Remark : A varallel line of argument could have 
been based on the function: 10 Ail 
h*(t ) = 1-2 Pr. (> Noe ae 


deriving from the determinant of the transition 
matrice in the underlying Markoff process. Along 
this line, admissibility is proved (B. Mandel- 
brot ) by using the Hartley - Shannon's informa- 
CHOON mere LOCMEDIN 

i b t 


V1.6. A n.a.s.c. for Q to be admissible is that 
HOT eMOMme mis Peay: Q for all length bounded 
Ae Viera 

(Necessity is obvious since the existence of such 


an a would make P> Kt Sufficiency derives 
directly from the same argument as in Wilas)) 


VI37. If © “contains, la" asi anal. omonesmay, 
add to Q at least a wort of the form xa - 

( or : a x ) - without destroying the coding 
character of Q . 

(Proof is combinatorial and is) based on the re- 
finement of Sardinas Patterson's algorithm. ) 

ss () (-!) 

VI.8 If Q satisfies QCQ QMQ Q and is ad- 
missible, a n.a.s.c. for it to be strongly neat 
on the right is that it be unitary on the left. 


(The initial condition imposes f= Ko sitto 
neat on the right, has % for generating set and 
alae ase Q, is the set of all words which are not 
prover right multiple of another word, then Q! 
generated by Q' is still neat on the right and is 
unitary on the left by definition. Since h(t ) 
and h' (t) admit both the mimimal root f = K™ 
they are identical and Q) = Qo If Q is uni- 
tary on the left, Q" the minimal U.N. code con- 
taining Q is, again for the same reason, iden- 
tical with Q). 


THE LOGIC THEORY MACHINE A COMPLEX INFORMATION PROCESSING SYSTEM 


Allen Newell and Herbert A. Simonl 
The RAND Corporation, Santa Monica, Calif. 
and the Carnegie Institute of Technology, 
Pittsburgh, Pa. 


Abstract 


In this paper we describe a complex in- 
formation processing system, which we call the 
logic theory wachine, that is capable of dis- 
covering proofs for theorems in symbolic logic. 
This system, in contrast to the systematic al- 
gorithms that are ordinarily employed in compu- 
tation, relies heavily on heuristic methods 
similar to those that have been observed in 
human problem solving activity. The specifi- 
cation is written in a formal language, of the 
nature of a pseudo~code, that is suitable for 
coding for digital computers. However, the 
present paper is concerned exclusively with 
specification of the system, and not with its 
realization in a computer. 

The logic theory machine is part of a pro- 
gram of research to understand complex informa 
tion processing systems by specifying and syn- 
thesizing a substanvial variety of such systems 
for empirical study. 


Introduction 


In this paper we shall report some re- 
sults of a research program directed toward the 
analysis and understanding of complex informa- 
tion processing systems. The concept of an in- 
formation processing system is already fairly 
clear and will be made precise in Section I, be- 
low. The term 'complex! is not so easily dis- 
posed of; but it is the crucial distinguishing 
characteristic of the class of systems with 
which we are concerned. 

We may identify certain characteristics of 
a system that make it complex: 

1. There is a large number of different 
kinds of processes, all of which are important, 
although not necessarily essential, to the per- 
formance of the total system; 

2. The uses of the processes are not fixed 
and invariable, but are highly contingent upon 
the outcomes of previous processes and on informa- 
tion received from the environment; 

3. The same processes are used in many 
different contexts to accomplish similar func- 
tions towards different ends, and this often re- 
sults in organizations of processes that are 
hierarchical, iterative, and recursive in nature. 

Complexity is to be distinguished sharply 
from amount of processing. Most current comput— 
ing programs for high speed digital computers 
would not be classified as complex according to 


ithe authors are indebted to Mr. J.C.Shaw of 
the RAND Corporation, who has been their part- 
ner in many aspects of this enterprise and 
particularly in undertaking to realize the log- 
ic theorist in a computer—work that will be 
reported in subsequent papers. 


61 


the above criteria, even though they may involve 
a vast amount of processing. In general they 
call for the systematic use of a small number of 
relatively simple subroutines that are only 
slightly dependent on conditions. In order to 
distinguish such systematic computational proc-— 
esses from the processes we regard as complex, we 
shall call the former algorithms, the latter 
heuristic methods, The appropriateness of these 
terms will become clearer as we proceed. 

One tactic for exploring the domain of com- 
plex systems is to synthesize some and study 
their structure and behavior empirically. This 
paper provides an explicit specification for a 
particular complex information processing system 
--a system that is capable of discovering proofs 
for theorems in elementary symbolic logic. We 
will call the system the logic theorist (LT), 
and the language in which it is~specified the 
logic. language (LL), This system is of interest 
for a number of reasons. First, it satisfies 
the criteria of complexity we have listed above. 
Second, it is not so large but that it can be 
hand simulated (barely). Third, the tasks it 
can perform are well-known human problem solving 
tasks--it is a genuine problem solving system. 
Fourth, there are available algorithms, and a 
realization of at least one of these algorithms 
(the Kalin-Burkhart machine),* that can perform 
these same tasks; hence, the logic theorist pro- 
vides a contrast between algorithmic and heuris- 
tic approaches in performing the same problem 
solving tasks, 

The task of this paper, then, is to specify 
LT with sufficient rigor to establish precisely 
the complete set of processes involved and ex- 
actly how they interact. This is a lengthy and 
somewhat arduous undertaking but one that the 
authors feel is required in the present state of 
knowledge. As a result, the paper largely ab— 
stains both from comment on the more general sig- 
nificance of the ideas and techniques introduced, 
and from relating these to contemporary work. 

The plan of the paper is to give, in Section 
I, a description of the language, LL, in which LT 


2 See B. V. Bowden! 


We should like to make general acknowledgment of 
our indebtedness for many of the ideas incorporat— 
ed in LL and LT to two areas of vigorous contem- 
porary research activity: (1) to research on 
automatic programming of digital computers, for 
the approach to the construction of LL; and (2) 

to research on human problem solving, for the 
basic structure of the program of LT. In addi- 
tion we should like to record a specific indebted- 
ness to the work of 0.G.Selfridge and G,P.Dinneen 
on pattern recognition, which clarified many basic 
conceptual issues in the specification and realiza 
tion of complex information processes. 


will be specified. In Section II there is given 
a verbal description of LT, which is closely e- 
nough tied in to the formal program to motivate 
most of the latter. Finally, in Section III, the 
program is given in full detail, 


I 


Language for Information Processing Systems 


The two major technical problems that-have 
to be solved in studying information processing 
systems by means of synthesis may be called the 
specification problem and the realization problem. 
To study all but the simplest of such systems, it 
is necessary to make a complete and precise state- 
ment of their characteristics. This statement, 
or specification, must be sufficiently complete 
to determine the behavior of the system once the 
initial and boundary conditions are given. An 
example, familiar to mathematicians, is a system 
specified by n first order differential equations 
in n variables, 

Once the specification has been given, a 
second problem is to find or construct a physical 
system that will behave in the manner specified. 
This can be a trivial or an insurmountable task. 
For example, it is relatively easy to find elec- 
trical circuitry that will behave like a system 
of linear differential equations; it is rather 
difficult to represent by circuitry most kinds of 
nonlinear systems. We will call the problem of 
finding or constructing the physical system the 
realization problem, and the particular physical 
system that is used the realization. 

_ Although this paper is concerned exclusively 
with the specification problem, the form of lan- 
guage chosen is dictated also by the requirements 
of realization. Since an important technique for 
studying the behavior of complex systems is to 
realize them and to study their time paths em- 
pirically under a range of initial and boundary 
conditions, they must be specified in terms that 
make this realization relatively easy. 

The high speed digital computer is a phys- 
ical system that can realize almost any informa- 
tion processing system and our research is ori- 
ented toward using it. Its limitations are in 
overall speed and memory, rather than in the 
complexity of the processes it can realize. The 
machine code of the computer is the language in 
which a system must ultimately be specified if it 
ig to be realized by a computer. Conversely, how- 
ever, once the system is correctly specified in 
machine code, the realization problem is essen- 
tially solved; for the computer can accept these 
specifications, and will behave like the system 
specified. 

The machine code, although suitable for 
communicating with the computer, is not at all 
suitable for human thinking or communication 


Avie prefer "realization"' to "simulation," for the 

latter implies that what is being imitated is an- 

other physical system. Since the specification 

is an abstract set of characteristics, not a phys- 
ical system, it is not correct to speak of "sim- 

lating" the specification. 


62 


about complex systems, For these purposes, we 
need a language that is more comprehensible (to 
humans), but one that can still be interpreted by 
the computer by means of a suitable program. 
Technically, such a language is known as a pseudo- 
code or interpretive language. Hence the two 
problems of specification and realization of an 
information processing system are subsumed under 
the single task of describing the system in an 
appropriate pseudo-code. é 

This paper is concerned solely with specify- 
ing the system of LT. The particular language, 
LL exhibited here has not been coded for a comput— 
er. However, one very similar to it, which is 
less convenient for exposition, is in the process 
of being coded and will be the subject of later 
papers. Here, no further mention will be made of 
the relation of the logic language to computers. 

The terms of the language that are undefined- 
-its primitives--determine implicitly a set of 
information processes that are to be regarded as 
elementary and not reducible, within the language, 
to simpler processes, The more complex processes 
are to be specified by suitable combinations of 
these elementary processes. Generally speaking, 
the elementary processes in LL are of the nature 
of information processes: that is, their inputs 
and outputs are comprised of symbolized informa- 
tion. 


Information Processing Systems: 
Basic Terms 


An information processing system, IPS, con- 
sists of a set of memories and a set of informa- 
tion processes, IP's. The memories form the in- 
puts and outputs for the information processes. 

A memory is a place that holds information over 
time in the form of symbols, The symbols func- 
tion as information entirely by virtue of their 
capacity for making the IP's act differentially. 
The IP's are, mathematically speaking, fumctions 
from the input memories and their contents to the 
symbols in the output memories. The set of 
elementary IP's is defined explicitly, and through 
these definitions all relevant characteristics 

of symbols and memories are specified. 

Particular systems can be constructed from 
the memories and processes of an IPS that behave 
in a determinate way once the initial information 
in the memories is given (initial conditions), 
along with whatever external information is stored 
in the memories during the course of the system's 
operation (boundary conditions), Each such par- 
ticular system we call a program, IPP, Thus an 
IPS defines a whole class of particular IPP's, 
and conversely, an IPP consists of an IPS togeth- 
sr with a set of rules that determines when the 
several information processes will occur. The 
logic language is an IPS; the logic theorist is 
an IPP, Many variations of LT could be con- 
structed with the same IPS, 


Symbolic Logic 


The logic language handles information re- 
ferring to expressions in the sentential calculus 
and their properties. This paper assumes some 


familiarity with elementary symbolic logic,’ and 
only a resume of the notation will be given. 

The sentential calculus deals with variables, 
Py» Gs e@eogy A, B, eoeyg a, b, eocgy which are usu- 
ally interpreted to mean sentences, These varia- 
bles are combined into expressions by means of 
connectives. The primitive connectives of White- 
head and Russell (and ours) are "not" (-) and 
"or" (v), In this paper we shall have occasion 
to use only one othey connective: "implies" (+), 
which is defined by: 


1.02 P*+Q “ger -P vq (Read: (p implies 
qa) is equivalent by 
definition to (not-p 
or q).) 

Coding 


A logic expression, X, is represented in the 
IPS by a set of elements, E, one corresponding 
to each variable and to each connective (exclud- 
ing the punctuation dots and negation symbols) in 
the logic expression. Each element holds a 
number of symbols that refer to the various prop- 
erties of the element. (Note that the term 
"element" and not the term "symbol" is used in 
this paper to refer to the variables and connec- 
tives in logic expressions; Symbols denote prop- 
erties of elements, and to each element there 
correspond a number of symbols.) An example 
will show what is meant by these terms.? Consider 
the expression 1.7: 


1.7 -p. qv-p ((not-p) implies (q or not-p)) 


The entire sequence is the expression, X(1.7). 
It consists of the elements -p, +, 4, v, -p. The 
expression may be written in "tree" form, as 
follows, where the rectangles indicate the ele- 
ments: 


For definiteness, we have used the system of A.N, 
Whitehead and Bertrand Russell.2 An introduction 
sufficient for our purposes will be found in D. 
Hilbert and W.Ackermann,* 


SFor ease of reference, we shall use the numbers 
employed by Whitehead and Russell to identify 
particular propositions and definitions, only 
omitting the asterisk (*) that they insert in 
front of the nunber. 


7 We follow Whitehead and Russell in using dots 

in place of parentheses as punctuation. It is un- 
necessary here to give exact rules for numbers 

of punctuation dots. 


The main connective at the top is called the 
main element, EM (1.7). The other elements are 
reached through a series of Left and Right branch- 
es from the main element. With each element there 
is associated a subexpression, namely, the sub- 
tree of which that element is the top element. 

The symbols in each element provide the fol- 
lowing information, which will be explained more 
fully as we proceed. 


Symbol 


G The number of negation signs (-) before 
the expression. In the figure above, two 
elements-—those containing the variable 
p--have G = 1; all the rest have G = Q, 
If a negation applies to a whole ex- 
pression it appears in the element asso- 
ciated with that expression. 

V_ Whether the element is a variable or noty 

F Whether the element is free, i.e., avail 
able for substitution, This is relevant 
only if E is a variable. 

CG The connective (v or). This is rele- 
vant only if E is not a variable. 

N The name of the variable or expression. 
In X(1.7), there are variables named "p'' 
and "q", 

P The position of the element in the tree. 
This is represented by a sequence of L's 
and R's, counting branches from the main 
element. In the figure, the P for each 
element is shown beneath the element, 

A The location of the whole expression (not 
the element) in storage memory. 

U Whether the element is to be viewed as a 
unit or not. The term "uit" will be ex- 
plained later. 


The eight symbols defined above characterize 
completely each element and the expression in 
which it occurs. For many purposes, however, it 
is convenient to define additional symbols ("de- 
scriptive symbols") that correspond to interest- 
ing or important properties of expressions. In 
LL, three such descriptive symbols, represented: 
as small positive integers, are defined. These 
are: 


H The number of variable places in an ex- 
pressica. Thus X(1.7) has three vari- 
able places: P = L, RL, and RR; hence, 
H(1.7) = 3. 

J The number of distinct variables (i.e., 
distinct names) in the expression, ignor- 
ing negation signs. Since X(1.7) contains 
the names "p" and "q", J(1.7) = 2. 

K The number of levels in the expression. 
The number of levels corresponds to one 
plus the maximum number of letters in P 
for any element in the expression. Hence, 
K(1.7) = 3. 


Memory Structure 
There are two kinds of memories, working memo- 


ries and storage memories. The major distinction 
—that all information to be processed must be 


brought in from the storage memories to the work- 
ing memories and then returned—will be brought 
out clearly when we define the elementary IP's. 
Structurally, the working memories hold single 
elements, E, with additional spaces for the sym- 
bols H, J, and K. Hence, we can picture a work- 
ing memory unit as: 


« G2 )-f--o 


The storage memories consist of lists. A list 
holds either a whole logic expression or some set 
of elements generated during a process, such as a 
set of elements having certain properties. Each 
list of logic expressions has a location, symbol- 
ized by A. The elements are placed in the list 
in arbitrary order, since the information in each 
element is sufficient to locate it wmequivocally 
in the tree of the logic expression. (The order- 
ing of the list is used only to carry out search- 
es.) For example, X(1.7) might be listed in the 
storage memory thus: 


No limitations are imposed here on number of 
memories, either working or storage. In actual 
fact, the number used is not large. 

Three particular lists have special locations 
in storage memory that can be referred to di- 
rectly in IP's: (1) the theorem list, T, of all 
axioms and theorems that have previously been 
proved; (2) the active problem list, P; and (3) 
the inactive problem list, Q. Each list consists 
of the main elements of the appropriate expres- 
sions (theorems or problems, respectively) in 
arbitrary order. For the rest, the storage memo- 
ry is entirely wnspecialized. 


Information Processes 


A term that specifies an IP is called an in- 
struction, by analogy with computer terminology. 
As Figure 1 shows, an instruction consists 


REFERENCE PLACES 
OPERATION LEFT CENTER RIGHT Ee aoe 


Walfeas Val 


of an operation part, three reference places (left, 
L, center, C, and right, R), and a branch loca- 
tion, B. The kinds of operations that can be 
performed by an IPS will depend, first, on what 
elementary IP's are postulated, and second, on 
what restrictions are placed on how they can be 
combined. For the moment, the exact nature of the 
elementary processes is unimportant; for con- 
creteness, the reader may think of the following 
as typical: transferring information from memory 
x to memory y, or adding the number in memory x 
to the number in memory y. 

The reference places refer to the working 
memories, so that the same operation may operate 
on different memories at different times and 


under different circumstances. The working memo- 
ries will be designated by small integers, 1, 2, 
eoey and by the letters x, y, % 

No direct reference is made in an instruction 
to any storage memory, except T, P, and Q, Lists 
are located by the A stored within elements be- 
longing to the lists; and elements within a list 
are located by their relation to known elements. 
An example will make this clear, A typical opera- 
tion involving the storage memory is: 


OPER LCRB 
FR xy 


which reads: Find the element that is the right 
subelement of E(x)--i.e., of the element in work- 
ing memory x--and put it in working memory y. 
The operation is executed thus: Working memory 
x contains the A(x) that is the location of the 
expression in which E(x) occurs. Memory x also 
contains the symbol P(x). Since we wish to put 
in y the right subelement of E(x), P(y) is oy 
definition obtained by appending an R to P(x), 
Hence, we can determine P(y), and locate E(y) by 
going to storage memory A(x) and searching the 
list of its elements in order witil we find the 
element with the correct P. We then transfer 
this element, which is the one we want, to work- 
ing memory y. 


Programs and Routines 


The rule of combination of IP's is simple: any 
one IP may follow another. We shall consider time 
to be discrete, using it essentially as an index, 
and shall assume that only one process occurs at 
atime. We say that a particular IP has control 
when it is occurring, Thus, when a sequence of 
IP's occurs one after the other in consecutive 
time intervals, there occurs a series of trans— 
fers of control from each IP to the next in the 
sequence, 

The operation of any IP includes a processing 
component and a control component. The process- 
ing component changes the memory content of the 
IPS; the control component transfers control to 
another IP, In some IP's, processing is the 
significant component. In these the transfer of 
control is independent of the memory contents at 
the time the IP occurs. In certain other IP's, 
control is the significant component. These do 
not alter memory contents, but transfer control 
to various IP's depending on the memory contents 
when they occur. In other IP's both processing 
and control components are significant, 


Control. We allow only a binary branch in 
control at any one instruction. Normally, con- 
trol passes in a linear sequence through a set 
of IP's, We write this sequence vertically. Each 
instruction is considered to have a location in 
the sequence. For branch instructions (those in 
which the control component transfers control to 
one of two IP's depending on memory content), 
control transfers either (1) to the next instruc- 
tion in the sequence or (2) to the instruction 
named in the branch location, B. These. locations 
are designated by letters A, B, C, .... 


Figure 2, Instruction #1 transfers controls to #2; 
#2 transfers control to #3 or branches to A 
(which is #4) depending on memory content; #3 
transfers control to #4; #4 transfers control to 
#5 or branches back to B, which is #1. 

Each control operation can be reversed in 
sense by putting a minus sign in front of the 
operation name. The effect of the minus sign is 
simply to reverse the condition of transfer. That 
is, if CC-A transfers to A when two specified 
numbers are equal, then -CC-A transfers to A when 
these numbers are unequal. 


LOCN OPER LC RB 


B--——-» 
Gone A 
#3 

A---» #4------» B 


etc. 


Fig. 2 


Routines. We will call such a list of in- 
structions with a control network a routine, 
again, in direct analogy to computer terminology. 
Notice that a routine satisfies our definition of 
a program (IPP): if all the memories referred to 
have specified initial contents, the routine de- 
termines their contents at all later times cov- 
ered by its duration. 

If we postulate a set of elementary informa- 
tion processes, each specified by an instruction, 
it might be supposed that each routine would de- 
fine a new (non-elementary) information process. 
This is not the case, for in LL the format of an 
instruction (Figure 1) allows reference to not 
more than three working memories and to not more 
than one branch. Hence, only those routines may 
be regarded as definitions of IP's which satisfy 
the following conditions: 


1. The routine contains branches to not more 
than two instructions outside the routine; 

2. Not more than three working memories that 
are to be referred to subsequently are changed by 
the routine. This means that even though other 
working memories are changed, there is no way to 
refer to these memories in subsequent routines. 


Within these restrictions we can define a 
seb of new IP's in terms of the elementary IP's, 
then another set of IP's in terms of both the 
elementary and defined IP's, and so on; thus 
creating a whole hierarchy of IP's and their 
corresponding routines. The elementary IP's and 
the hierarchy of defined IP's for LT are given in 
Section III, and its structure as explained in 
some detail in Section II, 

The restrictions imposed above on numbers of 
branches and working memories in IP's have the 
following two consequences for the structure of 
the routines that are used to define IP's: 


1. A working memory can be used only within 
the routine in which it is introduced. That is, 


65 


working memories introduced in a particular 
routine cannot be referred to when control is in 
any other routine, except as noted in rule 2. 
For this reason, no ambiguity arises from using 
the same names, 1, 2, . . ., for different memo- 
ries in distinct routines. 

2. Within the routine that defines a par- 
ticular IP, reference may be made to the working 
memories that are designated in the reference 
places of that IP, Let I, be an instruction that 
appears in the routine defining Ig. The symbols 
L, C, Rin Ij refer, respectively, to the working 
memories in the left, center, and right reference 
places of instruction I5, in whose definition 1, 
occurs, (See, for example, the first instruction, 
FEF, in the routine given in full at the end of the 
section.) Some such arrangement is obviously re- 
quired if the defining routine is to have any 
connection with the instruction it defines. 


Elementary Processes 


In LT there are forty fur different elementary 
processes. These represent variations on eight 
types of operations. The remainder of this sec- 
tion will be devoted to a description of these 
types, and an enumeration of the elementary proc— 
esses that belong to each type. Separate, ex-— 
plicit definitions for each elementary IP are 
given in Section III, The first letter in the 
name of an operation designates the type to which 
it belongs: A for assign, B for branch, CG for 
compare, F for find, N for numerical, P for put, 
S for store, and T for test. 

Find instructions obtain information from 
storage memory on the basis of stated relation- 
ships, and put it in specified working memories. 
An example, FR-x-y (Find the right subelement 
of E(x) and put it in y), has already been de- 
scribed. Two other Find instructions are very 
similar: FL (Find the left subelement) and FM 
(Find the main element), 

Other Find instructions involve the ordering 
relation on the lists. An example is: 


OPER _L CRB 
FEF xy A 


This reads: Find the first element in X(x)—the 
expression associated with E(x)--in the list 
A(x), and put this element in y. Then go to next 
instruction, but if no element is found, branch 
to instruction A. Here the order of elements is 
essential since there may be many elements in 
X(x). This kind of operation is used to start a 
search, and it is always combined with an instruc- 
tion, FEN, for continuing and terminating the 
search: 


OPER LC RB 
FEN xy A 


This reads: Find the element in X(x) that is 
next in order after E(y) and put it in y. When 
such an element is found, branch to A; if none 

is found, transfer control to the next instruc- 
tion in sequence. FEF and FEN together allow the 
familiar cycling or iteration that is a common 
feature of computing routines: 


(after all elements of X(x) have been processed) 


The complete list of elementary Find instruc- 
tions is: 


FEF FL FM 
FEN FR 


Store instructions transfer information from 
working memory back to storage memory. An example 
is: 


OPER LCRB 
Ss x 


This simply reads: Store E(x) in the storage 
memory, If the element in x is one that was pre- 
viously withdrawn from storage, it will be re- 
placed in its original location within A(x); if 
it is a new element in List A, it will be placed 
at the end of the list. 

Another elementary Store instruction is SEN, 
which puts E(x) into storage memory at the end 
of the list A(y). A third is *SX, which simp] 
stores a copy of X(x) in memory location A(y). 


The complete list of elementary store instructions 
is: 

Ss *SX *SXL *SXM 

SEN *SXE *SKR 


Instructions belonging to the remaining six 
types are concerned only with working memory 
(See Figure 3). No complex processing may take 
place in storage memory, and conversely, as we 
have seen, no information may be storedin work- 
ing memory except on a temporary basis. 

Put instructions transfer information and sym- 
bols around the working memory. A typical Put 
instruction is: 


OPER LC RB 
PE xy 


This reads: Put E(x) in E(y). The operation 
leaves E(x) unchanged and duplicates it in E(y). 
The variations on this instruction correspond to 
the different symbols in an element that may need 
to be transferred. The list of Put instructions 
is; 


PE FPCv PU 
PK PC? PUB 


Scertain of the Store instructions are marked with 
an asterisk, These are treated as elementary 
operations in the present section and in Part I of 
Section III, but in Part II of Section III they 
are defined in terms of simpler operations. 


Processing Work- 
N A P_ ing 


Control Mem 
CTB ory 


Figs 3 


Numerical instructions carry out the arith-. 
metic operations. An example is: 


OPER LC RB 
NAG <x 


This reads: Add 1 to G(x). Operations are re- 
quired to permit addition and subtraction for sym- 
bols G, H, J, K, and W. The list of Numerical 
instructions is: 


‘NAG NAH NAJ- NSG 
NAGG NAK NAW NSGG 


Assign instructions write in new names and 
locations in elements that are in working memory. 
One Assign instruction is: 


QPER_ L CRB 
AN x 


This reads: Assign an unused name to E(x), The 
other Assign instruction, AA, assigns new list 
locations. There are, then only two Assign in- 
structions: 


AA AN 


Compare instructions belong to a class of 
pure control instructions. They compare two sym 
bols for equality (or, if appropriate, for the 
relation "greater"); then transfer to the branch 
location if the condition is satisfied or to the 
next instruction in sequence if the condition is 
not satisfied. The sense of the branch on these 
and all other branch instructions can be reversed 
by a minus sign preceding the operation, A 
typical example is: 


OPER LCRB 
CC MY ek 


This reads: If C(x) = C(y), branch control to 
location A; if not, go to the next instruction in 
sequence, That is, if the connective in x is 
identical with the connective of y, we branch to 
A. Notice that there is no change in memory con- 
tent; only a transfer of control has occurred, 
The compare instructions ares 


CC CGG CWwG 
CN CKG CPS 


Test instructions are also control instruc- 
tions. They test the properties of a single 
element, and transfer control accordingly. The 
variations of the type deal with different prop- 
erties. An example is: 


OPER LCRB 
TU x 


This reads: If E(x) is a mit, transfer control 
to A; if not, go to the next instruction in se- 
quence. TC transfers control if C(x) is 
"implies"; goes to the next instruction if 0(x) 


is "or". The Test instructions are: 
TY TB TU TF 
TGG 


TC+ TN 


Branch instructions are unconditional con- 
trol instructions that cause the program to 
branch to the indicated address instead of going 
to the next instruction in sequence. The simplest 
example is: 


OPER LCRP 
B b 


When this instruction is reached, the program 
simply branches to instruction b in the same 
routine. 

When the instructions BHB or BHN occur in a 
routine, they cause the program to branch to an 
address determined by the higher-level instruction 
that the routine defines. For example, suppose 
BHB appears as one of the Saoaee instructions 
within the routine: 


OPER LCRB 
MSb x b 


Then, the oceurrence of BHB will cause control to 
branch to the address b of MSb,. 

Suppose, further, that MSb appears as one of 
the instructions in the routine Ex, and that the 
instruction MDt appears immediately after MSb in 
Ex. Then, if BHN is one of the instructions in 
‘the routine MSb, its occurrence will cause con- 
trol to branch to the next instruction after MSb 
in the higher routine, Ex, i.e., to MDt. Thus 
BHB and BHN are the instructions that terminate 
eontrol by a particular routine, and cause con- 
trol to transfer, respectively, to the branch 
designated in the higher-level instruction de- 
fined by the routine, or to the higher-level in- 
struction that follows the routine. Instruction 
BHB produces the former transfer, BHN, the latter. 
The three Branch operations are: — 

B BEHB- BEN 
Example. It will clarify matters and provide 

some introduction to the complete program given 
in Section III if we set forth in detail one of 
the simpler defined routines, the routine NH. 
This routine consists of six instructions, all 
of them primitives included in the list we have 
already given: 


A OPER LCRB 
Count the number of vari- . 
able places in X(x), and 


record the result in H (x). 


NH 


67 


(1) FEF finds the first element in X(x) and puts 
it in working memory 1. If there is no element, 
it branches to C. (2) -CPS (note the negative 
sense) determines whether E(1) is a subelement 
of E(x). If it is not, control transfers to B; 
if it is, control transfers to the next instruc- 
tion in sequence. (Henceforth we will abbre- 
viate these transfers as ~B and ~next, respec- 
tively.) (3) -TU determines whether E(1) is a 
unit (i.e., is to be viewed as a variable). If 
it is not (negative sense), ~B, if it is, ~next. 
(4) NAH increases by 1 the nusber H(x). (Because 
of the previous branches, NAH will occur only if 
the element in 1 is viewed a8 a variable and is a 
subelement of the element in x.) (5) FEN finds 
the next element in X(x), puts it in working 
memory 1, and returns control to instruction A, 
whereupon the cycle is repeated from step ere 
If there are no more elements, ~next. (6) BHN 
terminates the routine after all elements in 


.X(x) have been examined, and transfers control to 


the instruction that follows NH at the next high- 
er level of the hierarchy of routines. 


~ 


Conclusion 


We have now completed ow’ description of the 
language LL. We have outlined the coding system, 
the memory structure, the structure of the in- 
formation processes, the routines, and the types 
of elementary processes, Further detail can be 
found by consulting Section III, In Section II 
we shall construct in this language a program, 
LT, that will permit the information processing 
system to solve problems in symbolic logic. 


It 


The Logic Theory Machine 


In the language we have Cone ee we have 
variables (atomic sentences): p, q, r, A, 3, C, 
ee. and connectives: — (not), v (or), ~ (implies). 
The connectives are used to combine the variables 
into expressions (molecular sentences). We have 
already considered one example of an expression: 


1.7 -p eo gv =p 

The task set for LT will be to prove that 
certain expressions are theorems-—-that is, that 
they can be derived by application of specified 
rules of inference from a set of primitive sen- 
tences or axioms. 

The two connectives, - and v, are taken as 
primitives. The third connective, ~,is defined 
in terms of the other two, thus: 

1.01 P~q “def -P VQ 
The five axioms that are postulated to be 
true are: 


1.2 PVP eo Pp 

1.3 p.m Qvp 

1.4 PVQ.% a4vVp 

1.5 P Ve QV Yr 2%: q .Ve pvr 

1. p>q wm rvpem. rvq 


Each of these axioms is stored as a list in 
the theorem memory, T, with all its variables 
marked free, F, in their respective elements. 


From the axioms other true expressions can 
be derived as theorems. In the system of 
Principia Mathematica, there are two rules of 
inference by means of which new theorems can be 
derived from true expressions (theorems and 
axioms), These are: 

Rule of Substitution: If A(p) is any true 

expression containing the variable p, and 

B any expression, then A(B) is also a true 

expression. 

Rule of Detachment: If A is any true ex- 

pression, and the expression A ~ B is also 

true, then B is a true expression. 

To these two rules of inference is added the 
rule of replacement, which states that an ex— 
pression may be replaced by its definition. In 
the present context, the only definition is 1.01, 
hence the rule of replacement permits any occur- 
rence of (-pvq) in an expression to be replaced 
with (pq), and any ogeurrence of (p*q) to be 
replaced with (-pvq). 

In this system, then, a proof is a sequence 
of expressions, the first of which are accepted 
as axioms or as theorems, and each of the re- 
mainder of which is obtained from one or two of 
the preceding by the operations of substitution, 
detachment, or replacement. 


Example: prove 2.01, p> -p .% -p: 
(1)! pv pe p (axiom 1.2)19 
(2) ! -pv-p.% -p (by subst. of -p for p) 
(3) ! p+-p.> =p (by replacement on left) 


The problem now is to specify a program for 
LT such that, when a problem is proposed in the 
form of a theorem to be proved (like 2.01 above), 
a proof will be discovered and constructed. 
First, it should be observed that there is a 
systematic algorithm for constructing such a proof 
should one exist. Starting with the five axioms, 
we construct all the theorems that can be ob- 
tained from them by a single application of the 
rules of substitution, detachment, or replace- 
ment.++ We thus obtain the set of all theorems 
that can be obtained from the axioms by proofs 
not more than one step in length. Repeating this 
process with the enlarged set of theorems, we ob- 
tain the set of all theorems that can be derived 


94s we shall see, 1.01 is not held in storage 
memory, but is represented, instead, by two 
routines for actually performing the replacements 


10the exclamation point in front of an expres- 
sion indicates that the expression in question is 
asserted to be true. To designate an expression 
whose truth has not been demonstrated, we will 
use a question mark preceding the expression. 


Uy technical difficulty arises from the fact 
that there is an infinite number of valid sub- 
stitutions. This difficulty can be removed 
rather easily, but the question is irrelevant 


for the purposes of this paper. 


68 


from the axioms by proofs not more than two steps 
in length. Continuing, we finally obtain the set 
of theorems that can be derived by proofs not more 
than n steps in length. 

Now, if the theorem in which we are inter- 
ested possesses a proof k steps in length, we 
can, in principle, discover it by constructing 
all valid proof chains of length not more than k, 
and selecting any one of these that terminates in 
the theorem in question. This "in principle" 
possibility is in fact computationally infeasi- 
ble because of the very large number of valid 
chains of length k that can be constructed, even 
when k is a number of moderate size. Under these 
circumstances, the rules of inference do not give 
us sufficient guidance to permit us to construct 
the proof we are seeking; and we need additional 
help from some system of heuristic, 

The problem will be solved if we can devise 
a program for constructing chains of theorems, 
not at random, but in response to cues that make 
discovery of a proof probable within a reasonable 
computing time. For example, suppose the rules 
of inference were such as to permit any given 
proof chain to be continued, on the average, in 
ten different ways. Then there would be ten 
thousand proofs chains four steps in length (10). 
The expected number of proof chains that would 
have to be examined to find any particular proof 
by random search is five thousand. Suppose, how- 
ever, that LT responded to cues that permitted 
eight of the ten continuations at each step to be 
Sliminated from consideration. Then the number 
of proof chains four steps in length that would 
have to be examined in full would be only sixteen 
(24), and the expected number would be only eight. 


The Program of LT 


We wish now to describe the program of LT, 
which is given in full in Section III; henee, in 
the text we shall refer frequently to Section III 
for detail. We shall refer to each routine by 
its name (e.g., LMe for the matching routing), 
but we shall need some additional notation to re- 
fer to the main segments of routines that do not 
“themselves have names, The names of these seg- 
ments are given in Section III in the colum 
marked-'Seg." In each segment there is generally 
one main operation to be performed; and this main 
operation, or sub-routine, is usually surrounded 
by a number of procedural and control operations 
that fit it into the larger routine. In ordinary 
language, we would say that the "function" of the 
segment is to perform the main operation that is 
contained in it. For example, the main operation 
in the third segment of lMc is LSby, a substi- 
tution. The function of this segment in the 
matching program is to substitute one sub-expres- 
sion for another in one of the expressions being 
matched. Hence, we will name the segment after 
the main operation: LMe(Sby), Similar designa- 
tions will be used for the other segments of 
routines, This notation emphasizes the fact that 
each routine consists in a sequence (or branching 
tree) of main operations that are connected by 
procedural and test operations. Thus, an abbre- 
viated description of the matching routine might 
be given as: 


LMe 

T Perform diagnostic tests 

LMc Recursion of matching with next elements 
in logic expression 

Sby Substitute the element y for the 
element x 

Sbx Substitute the element x for th 
element y 

CN Compare variables in x and y 

Rp Replace connectives, if required and 
possible 


The Substitution Method 


Let us take as our first example the very 
simple expression, 2.01, for which we have al- 
ready given a proof. We suppose that, when the 
problem is proposed, LT has in its theorem. memo- 
ry only the axioms, 1.2 to 1.6. We wish to 
construct a proof (the one given above, or any 
other valid proof) for 2.01. 

As the simplest possibility, let us con- 
sider proofs that involve only the rules of sub- 
stitution and replacement. We are now able to 
state the problem thus: how can we search for a 
proof of the expression by substitution without 
considering all the valid substitutions in the 
five axioms? We use two devices to focus the 
search. Both of these involve "working back- 
ward" from the expression we wish to prove--for 
by taking account of the characteristics of that 
expression, we can obtain cues as to the most 
promising lines to follow: 


1. In attempting substitutions, we will 
limit ourselves to axioms (or other true theo- 
rems, if any have already been proved) that are 
in some sense "similar" in structure to the theo- 
rem to be proved. The routine that accomplishes 
this will be called the test similarity routine, 
CSm, anes == 

2. In selecting the particular substitutions 
to be made in a theorem that has been chosen for 
trial, we will attempt to match the variables 
in that theorem to the variables in the expres— 
sion to be proved. Similarly, we will try to use 
the rule of replacement to match=connectives. 

The routine in which these various operations 
occur is called the matching routine, Lic. 


Using these devices, the proposed routine 
for proving theorems—the method of substitution, 
MSb——works as follows, MSb(Sm): search for an 
axiom or theorem that is similar to the expres- 
sion to be proved. MSb(Mc): when one is found, 
try to match it with the expression to be proved; 
if a match is successful, the expression is 
proved; if the list of axioms and theorems is ex- 
hausted without producing a match, the method 
has failed, (Reference to Section III will show 
that there is another segment, MSb(NAW), that 
we have not mentioned. The function of this seg- 
ment will be discussed later in connection with 
the executive routine.) 


To see in detail how the method operates, we 
next examine the main operations, CSm and LMc, 
of the two segments of the substitution method. 
For concreteness, we will carry out these opera- 


69 


tions explicitly for the proof of the expres- 
sion 2.01. 


2.01 ? Pp =p on, =p 

Test for Similarity, CSm, We must state what 
we mean by similarity. We start from a common- 
sense viewpoint and regard two propositions as 
similar if they "look" similar to the eye of a 
logician, In Section I we have alteady defined 
three .characteristics of an expression that can 
be used as criteria of similarity. These are: 
K, the number of levels in the expression; J, 
the number of distinct variables in the expres- 
sion; and H, the number of variables in the 
expression,12 


Applying these definitions to 2.01 (routines 
NK, NJ, and NH, respectively), we find that 
K = 3, J =1, and H = 3. That is, 2,01 has three 
levels, one distinct variable (p), and three 
variable places. We may write this: 


D(2.01) = (3,1,3) 


In the same way, we can write descriptions 
for the various sub-expressions ¢ontained in 
2.0l--in particular, the sub-expressions to the 
left and to the right of the main connective, re- 
spectively. We have for these: 


DL(2.01) =(2,1,2); and DR(2.01) = (1,1,1) 


Now, we say that two expressions, x and y, 
are similar if they have identical left and right 
descriptions; i.e., if DL(x) = DL(y) and DR(x) = 
DR(y). The routine for determining whether two 
theorems are s ar, CSm, consists of two seg- 
ments: (1) CSm(D), a description segment, and 
(2) CSm(CD), a comparison of descriptions. The 
description segment is made up of four descrip- 
tion routines, D, one each to compute DL(x), 
DR(x), DL(y), and DR(y). The comparison segment 
is made up of two compare description routines, 
CD, one of which compares DL(x) with DL(y), the 
other DR(x) with DR(y). : 

A diagram of the hierarchy of principal sub- 
routines in testing similarity will look like 
this: 


12the assertion is that two expressions having 
the same description "look alike" in some un- 
defined sense; and hence if we are seeking to 
prove one of them as a theorem, while the other 
is an axiom or theorem already proved, then the 
latter is likely construction material for the 
proof of the former. Empirically, it turns out 
that with the particular definition of similar- 
ity introduced here, in proving the theorems of 
Chapter 2 of Principia Mathematica about one 
theorem in five that is stored in the theorem 
memory turns out to be similar to the expression 
we are seeking to prove. It is easy to suggest 
a number of alternative, and quite different 
criteria that would be equally symptomatic of 
"similarity." Uniqueness is of no account here; 
all we are concerned with is that we have some 
criteria that "work"—- that select theorems 
suitable for matching. 


In the case of 2.01, the segment MSb(Sm) will 
search the list of axioms and theorems and will 
find that axiom 1.2 is similar to 2.01: 

1.2 j PV Pp. Pp 

for it, too, has the descriptions: DL(1.2) = 
(2,1,255 DR(1.2) = (1,1,1). Moreover, 1.2 is 
the only axiom that has this description. 


Matching Expressions, LMc. Next we carry 
out a point-by-point comparison between 2.01, 
the expression to be proved, and 1.2, the axiom 
that is similar to it. We start with the main 
connectives, and work systematically down the 
tree of the logic expressions—-always as far as 
possible to the left. In the present case the 
order in which we will match is: main connective 
(P = none), connective of left sub-expression 
(P=L), left variable of sub-expression, (P=LL), 
right variable of sub-expression (PeLR), and 
right sub-expression (P=R), 

The matching routine is fairly complicated, 
consisting of six segments, but not all segments 
are employed each time two elements are matched, 
The first segment, LMc(T), and the initial opera- 
tions of most of the other segments consist of 
tests that determine whether the two elements to 
be matched are already identical, whether they 
can be made identical by substitution (if one is 
a free variable) or by replacement (if both are 
connectives), or—finally—-whether matching is 
impossible. The second segment, IMc(IMc), is a 
recursion of the matching routine with each of 
the next lower pair of elements in the tree of 
the expression. This recursion segment operates 
only if the elements to be matched in LMc are 
identical connectives (or have been made so). 

The third and fourth segments, LMc(Sby) and 
LMc(Sbx), apply the rule of substitution when 
the tests have shown this to be appropriate, 
LMc(Sby), which is executed whenever E(x) is a 
free variable,13 simply substitutes the expres- 
sion X(y) for E(x). LMe(Sbx), which is executed 
whenever E(y) is a free variable, substitutes 
the expression X(x) for E(y). In both cases, of 
course, substitution must take place throughout 
the whole expression in which the free variable 
occurs. This is taken care of automatically by 
the.process LSb. Also, since LMc. matches X(x) 
to X(y), LMc(Sby) has priority over IMc(Sbx), as 
a careful examination of the test network will 
reveal, 

The fifth segment, LMc(CN), reports the 
successful termination of the matching program 


l3gssentially, a variable is free when no sub- 
stitution has yet been made for it. After any 
substitution it is bound and no longer available 
for subsequent substitutions. As previously 
noted, all variables in expressions stored in the 
theorem memory are free. 


70 


if E(x) and E(y) are identical variables, its 
failure if they cannot be made identical by 
substitution. 

The sixth segment, LMc(Rp), operates when 
E(x) and E(y) have different connectives. The 
segment replaces the connective in x by the 
connective in y whenever this replacement is 
legitimate, and then returns control to the re- 
cursion segment. 

By virtue of the recursion segment, the match- 
ing routine will attempt to match each pair of 
elements; if successful, will proceed to the next 
pair; if unsuccessful, will report failure. 
Hence, the routine will continue until it makes 
the theorem that is being matched identical with 
the expression to be proved, or until the match- 
ing fails, 

The hierarchy of principal routines looks 
like this: 
iy 
LRp+-v) > LMc 
ft ene 


Returning to our specific example of the two 
similar expressions, 1.2 and 2.01, we carry out 
the matching routine as follows: 


2,01 ? 
ae. L 


p~*-p ™% -p 
Av Aw A 


(We use A instead of p in 1.2 to indicate that 
the variable is free (F).) 
a. The main connectives agree: both are ~, 
b. Proceeding downward to the left, the 
connective is ~ in 2.01, but v in 1.2. 
To change the v to +, we mst have (be- 
cause of the definition, 1.01, a — be= 
fore the left-hand A in1.2. This we 
can obtain by making the substitution of 
-B for A in1l.2. Having carried out this 
substitution, and having then replaced 
(-B v -B) with (B» -B), we have the 
following situation: 


2.01 ? p a -p eo, -p 
1.2! 3 B+ -B .» -B 
c. Proceeding again to the left, we find B 


in 1.2', but p in 2.01. We therefore 
substitute p for B in 1.2', and now find 
(after recursion through the remaining 
two elements) that we have a complete 


match: 
2.01 $9 Pp =P oe iP 
1.2'! 3 p -p ee 8) 


Thus, we have discovered a proof of 2.01 
(in fact, precisely the proof we gave before), 
which consists in substituting the variable -p 
for the variable in 1.2, and replacing the 
connective v in 1.2 with-. 


This completes our outline of the method 
of substitution as a routine for discovering 
proofs in symbolic logic. The method may be 
viewed as an information process that is composed 
of a considerable number of more elementary in- 


formation processes arranged to operate in highly 
conditional sequences. Each of the main compo- 
nents—the test for similarity routine, and the 
matching routine--is made up, in turn, of sub-— 
routines. The test conditions that control the 
branchings of the sequences depend in a number of 
instances upon the outcomes of searches through 
the theorem memory. Hence, the method of sub- 
stitution represents a complex information proc- 
ess in the sense in which we have defined the 
term. Combining the two diagrams depicted above, 
we can illustrate the hierarchy of the main 
operations that enter into the substitution 
method: 


LSb 


The method is a heuristic one, for it em 
ploys cues, based on the characteristics of the 
theorem to be proved, to limit the range of its 
search; it does not systematically enumerate all 
proofs, This use of cues represents a great 
saving in search, but carries the penalty that a 
proof may not in fact be found. The test of a 
heuristic is empirical: does it work? 

Moreover, the cues that are used in the 
method are not without cost. For example, in 
order to limit matching attempts to "similar" 
theorems, theorems must be described and compared. 
The net saving in computing time, as compared with 
random search, is measured by the reduction in 
the number of theorems that have to be matched 
iess the cost of carrying out the search and com- 
pare for similarity routines. Stated otherwise, 
cues are economical only if it is cheaper to ob- 
tain them than to obtain directly the information 
for which they serve as cues. 

To be sure, we have found a proof for one 
proposition in Principia; but how general is the 
substitution method? On examination of the 67 
propositions in Chapter 2 of Principia, it 
appears that some 21 can be proved by the method 
of substitution, including for example: 2,01, 
2e02s 25035) Se0ks 2e055 210, Codey Cotls 2o20, 
2.27. The remaining propositions evidently re- 
quire more powerful techniques of discovery and 
proof. It is evident, for instance, that we must 
employ the rule of detachment. 


The Method of Detachment 


We will describe next the method of detach~ 
ment, MDt, which, as its name implies, incorpor- 
ates the rule of detachment. The method, of 
course, is not synonymous with the rule, but in- 
cludes also heuristic devices that select par- 
ticular theorems to which the rule is applied. 
Let us review the principle of logic that 
underlies the method. 
expression A is a theorem; and assume that there 
are in the theorem memory two theorems, B and 


Suppose LT must prove that 


(et 


BA, Then, by application of the rule of detach- 
ment to B and BA, A is derivable immediately. 

We can generalize this procedure by combin- 
ing matching (substitution and replacement) with 
detachment. Assume that the theorem memory con- 
tains B® and B'+A'; that A is obtainable from A' 
by matching; and that B'is obtainable from B" by 
matching. Then we can construct a proof of A as 
follows: (1) By matching with B", B' is a theorem. 
(2) Since B'+A' is also a theorem, it follows by 
detachment that A' is a theorem. (3) By match 
ing with A', A is a theorem. 

This settles the problem of constructing a 
valid proof by the method of detachment. From the 
standpoint of the discovery of a proof employing 
this method, the trick lies again in narrowing 
down the search for B'~A' and B", so that these 
do not have to be sought through a very large 
scale trial-and-error search and substitution - 
program. 


The 
basic structure of the detachment method is quite 
similar to that of the substitution method, for 
both methods utilize the same basic operations. 
The first two se ts of the detachment method, 
MDt(SmV) and MDt(SmCt), carry out searches for 
similar expressions, in a way that will be in- 
dicated more precisely below. The next segment, 
MDt(Mc), carries out a matching of any expression 
so found with the theorem to be proved. If the 
matching is successful, a new problem is created 
by the segment MDt(F). This problem is then 
attacked, in the final segment, MDt(MSb), by the 
method of substitution. 

Again, designate by A the expression to be 
proved. In MDt(SmV) we search the theorem memory 
for theorems whose right sides are similar (by 
the test, CSm, described previously) to the whole 
expression A. If we find such a theorem (call it 
T), we go to segment MDt(Mc), and apply the match- 
ing. operation to the right side of T and to A. 

If we are successful in the matching, we find the 
left side of T, MDt(P); and seek to prove by the 
method of substitution that it is a theoren, 
MDt(MSb). For if the left side of T is a theorem 
and T is a theorem, then by detachment, the right 
side of T is a theorem. But A can be obtained 
from the right side of T by substitution, hence 
is a theorem. (Note that a check is made to see 
that T has ~ for a connective.) 


Contraction. If the detachment method fails 
to find a proof in the manner just described, 

a new attempt is made by means of the second seg- 
ment, MDt(SmCt), employing a different criterion 
ef similarity from the one we have used thus far. 
If the theorem is similar, the method proceeds 
with the matching segment exactly as before. 

.To see what is involved in this generalized 
notion of similarity, let us consider two expres- 
sions, A and A', with different descriptions. 

If A has more levels and variable places than A', 
it is still possible that A is derivable from A' 
by substitution--specifically, by substituting 
appropriate molecular expressions for the vari- 
ables of A. For example, take as A the expres- 
sion: 


2.06 ? preg .* Ger .% per 

for which we have DL(2.06) = (2,2,2), DR(2.06) = 
(3,3,4)3 and take as A’ the expression: 

At ? a ee bc 

for which we have DL(A') =(1,1,1), DR(A') = (2,2,2) 

If in A' we substitute pq for a, q-r for b, 

and p-r for c, we obtain 2.06. Operating in the 
reverse direction, if we contract 2.06 by making 

the inverse substitutions, we obtain A', We can 

therefore refer to A' as "2,06 viewed as contract- 

ed," 

Since the purpose in searching for similar 
theorems is to find appropriate materials to 
which to apply the matching routine, there is no 
reason why we should not use this more general 
notion of similarity if it proves effective in 
finding materials that are useful. 

In general, what parts of an expression 
should be considered as units in the search for 
proofs is not a "given" for the problem solver. 

LT makes an explicit decision each time it looks 
for similar expressions as to what subexpressions 
will be taken as mits. In contracting 2.06, a 
decision has been made that the elements p, q, and 
r are too small, and that more aggregative ele- 
ments, e.g., (pq) = a, should be perceived as 
units, 

Examination of the routines for describing 
expressions.(NH, NK, NJ) will reveal that these 
routines in fact count units rather than vari- 
ables. Normally, the variables are the units used 
in description, for VV precedes CSm in every pro- 
gram except MDt. In the latter program, however, 
it is sometimes useful to view expressions as 
contracted, by means of VCt. : 


Example of Proof by Detachment. To illustrate 
the method of detachment, let us carry out ex- 
plicitly the proof of 2,06: 

2.06 G pq .%3 Ger o> por 

The reader may verify that this theorem 
cannot be proved by substitution in the axioms 
and earlier theorems. Moreover the detachment 
method without contraction will also fail, for 
there is no theorem whose right side is similar 
to 2.06. However, we have already seen that when 
we contract 2.06, we obtain: 

A‘ 2 am bre 

where p~q has been contracted to a, qur to b, and 
per to c, We now have DL(A') = (1,1,1) end 
DR(A') = (2,2,2), descriptions that are identical 
with the descriptions of the sub-expressions of 
the right side of 2,04. 


Bo. ArC 
ax. b-ec 


A> B2~C 373 


2.04 $ 
Al 


Having selected 2.04 by use of the routine 
MDt(SmCt), we now proceed to match its right side 
with 2.06 in segment MDt(Mc): 


72 


20h bs A, “ote Bee 6 099. Bae PAR 
2.06 ? prq .* Qer.*. per 
2.04' 3} 


Qr 03 PM. POF esse PG os G°*P.*% prr 


We have now created a new problem to replace 
the original one: to prove that the left side of 
2.04! (the part underscored) is a theorem. We 
apply the method of substitution, MDt(MSb). The 
search of the theorem memory discloses 2.05 to be 
similar to the left side of 2.C4', and we proceed 
to match them: 


2.04'L ? 
2.05 i 


It is easy to see that with the substitution 
of q for A, r for B, and p for C, the matching 
will be successful. Hence we have B (2.05 with 
the indicated substitution), and BeA (2.04'), 
from which A (2.06) follows by the rule of de- 
tachment. 

The diagram below summarizes the principal 
routines incorporated in the method of detach- 
ment. A comparison of this diagram with the one 
for the substitution method shows clearly that 
both methods rest on the same component processes, 
with minor modifications and new combinations 
and conditions. The sole new process involved in 
detachment is the viewing of theorems as con- 
tracted. 


qr .>: 


APB .-3 


Prd o%. per 
C+A .> CB 


ON nat 


csm 7 

LMe 

MSb 
The Chaining Kethod 


A number of expressions that do not yield 
to the method of substitution can be proved by 
the method of detachment. We shall add an addi- 
tional method, however, to the repertoire avail- 
able to LT. We shall call this method chaining, 
MCh. Like the methods previously described, 
chaining involves heuristic procedures which we 
shall consider first. 

Theorem 2.06, which we have just proved, em- 
bodies one form of the principle of the syllo- 
gism (2.05 is another form of this principle). 
Now suppose T,, (pq) is a true theorem, and To, 
(qr) is another true theorem. Theorem 2.06 is 
of the form: 


Tj] .% 1278 


where E is (pr), an expression not known to be 
true. By detachment, from ! T] and !$T.~.TovE, 
we get {| To*E, By a second detachment, from 

$ Tz and ! To%E, we get | E. Hence, if we know 
p°-qg and qr to be true, we can construct a proof 
of per by means of two detachments with the use 
of 2.06. Instead of carrying through this deriv- 
ation explicitly in each instance, we simply con- 
struct a program that makes direct use of the 
transitivity of syllogism. This proof method is 
the basis for chaining. 


Suppose that we wish to prove A*C. We search 
for a theorem, T (with + for a connective) whose 
left side is similar to A, using the segment 
MCh(SmF). We match the left side of T with A, 
MCh(McF), and if we are successful, we have then 
proved a theorem of the form A~B, for T,as 
modified by matching, is of this form. We check 
first, in segment MCh(McR) whether we can simply 
match B to C, If we succeed’, we have proved the 
theorem. If we fail, we now construct, by segment 
MCh(S), the expression BC, and attempt to prove 
this expression by substitution, MCh(MSb). If 
we are successful, we now have a chain: A#B, BC, 
Then by syllogism, as indicated above, we obtain 
A~C, the expression we wished to prove. 

The procedure just described is chaining for- 
ward. Alternatively, we may chain backward. 

That is, to prove AWC, we may search for a theo- 
rem of the form B+C; then try to prove A~B by 
substitution. 

Proof by the chaining method is illustrated by: 
2.08 ? p-?p 
A search for theorems that have left sides simi- 
lar to 2.08 yields 1.3, 2.02, and 2.07. The 
latter is: 


2.07 j p .%. pvp 


If we take 2.07 as the (A+B) of the schema 
given above, then Bis (pvp). ‘Iwo theorems have 
left sides similar to B: 1.2 and 2.01. An attempt 
to match the left side of 2.01 to the right side 
of 2.07 wili be unsuccessful, but the matching is 
immediate with 1.2: 


2-07 j 
1.2 3 


Pp .%. pp 
PYp o%- Pp 


Hence we can takel,2 as the (B*C) of the chaining 
method. We now form (A*C) by joining the left 
side of 2.07 to the right side of 1.2 by +. The 
result is 2.08: 
2.08 3 P e*e Pp 

The chaining method is summarized by the 
following diagram: 


CSc 
LMc 
MSb 


The Executive Routine 


It remains to complete the specification of 
LT in two directions; first, to assemble the 
three methods that have been described into a co- 
herent program; and second, to show how the in- 
formation processes in terms of which LT has been 
described here can be specified precisely in 
terms of the elementary processes listed in 
Section I. The latter task is carried out in de- 
tail in Section III. We will turn our attention 
here to the former, which is embodied in the 
executive routine, Ex, ; 

In its first segment, Ex(R), the executive 
routine reads a new expression that is presented 


ar. 
a ee 


73 


to it for proof, and places it ina working 
memory e: : 
In the next three segments, Ex(MSb), Ex(MDt), and 
ExMCh), successive attempts are made to prove the 
expression by the methods of substitution, detach- 
ment, and chaining, respectively. If a proof is 
obtained by one of these methods, the executive 
routine writes the proof, Ex(WP); and stores the 
newly-proved theorem (changing all its variables 
to free variables) in the theorem memory, Ex(ST). 
To explain what happens if the three methods 
are unsuccessful, we have to take up some details 
that were omitted above. These have to do with 
the creation of subsidiary problems and with stop 
rules. 


Subsidiary problems. Both detachment and 
chaining are two-step methods. Suppose we wish 
to prove A. Im detachment, we try to find a 
theorem, BeA, and if we are successful, we then 
try to prove B. The task of proving B we may call 
a subsidiary problem. 

Suppose we wish to prove a~b. In chaining, 
we try to find a theorem, a*c, and if we are — 
successful, we then try to prove c~b. The task 
of proving cb is also a subsidiary problem. 

Within both the detachment and chaining meth- 
ods, only the method of substitution is applied 
to the subsidiary problem. If that method fails, 
failure is reported for the main problem. But 
before control is shifted back to the executive 
routine, the main element of the subsidiary prob- 
lem is stored in the problem list, P, in the 
storage memory, (The operation that stores the 
problem in the problem list is the operation SEN 
that can be found in segment MDt(P) and segment 
MCh(P).) 

When the three methods have failed for a given 
problem, the executive routine stores it in the 
inactive problem list, @. It then selects from 
the problem list, P, an expression that is, ina 
certain sense, the simplest--specifically, an ex- 
pression with the smallest possible number of 
levels, K, Ex(CK). It erases this new subsidiary 
problem from P; checks to make certain it does not 
duplicate one previously attempted, Ex(CX); and 
then tries to solve this subsidiary problem by the 
methods of detachment and chaining.+ This se- 
quence is repeated until some subsidiary problem | 
is solved (in which case the main problem is also 
solved), or until no problems remain on the prob-= 
lem list, or until the other stop rule, to be 
described, comes into operation. In the latter 
two cases, the routine reports that it is unable 
to prove the theorem, Ex(WNP). 


l4certain segments of Ex, in particular Ex(R), 
Ex(WP), Ex(ST) and Ex(WNP), are not written in 
Section III in terms of the primitives but are 
simply indicated by parentheses. It would be 
rather simple to formalize them, but this would 
further lengthen the description of the program. 


there is no need to attempt to prove the subsi- 
diary problem by substitution, since an unsuccess-~ 
ful substitution attempt was made immediately be- 
fore the expression was stored in the subsidiary 
problem list, 


The check to prevent duplication of subsidiary 
problems, Ex(CX), is handled as follows: for each 
problem that is selected from list P by Ex(CK), a 
check is made, by Ex(CX), against all expressions 
in the inactive problem list, Q, and if the new 
problem duplicates any expression found there, it 
is dropped. The main operation of. this segment, 
CX, applies the same basic tests of identity. of 
elements that are applied in the matching pro- 
gram, but does not modify the expressions to make 
them match. 


Stop Rules. Since all proof methods may fail, 
even if the expression given to LT is a genuine 
theorem, the executive routine needs a stop rule. 
One stop rule is provided by the exhaustion of 
list P, but there is no guarantee that the list 
will ever be exhausted. A second stop rule is 
provided by an operation that measures the total 
amount of "work" that has been done in attempting 
to prove a theorem, and that terminates the pro- 
gram with a "no proof" report when the total 
work exceeds a specified amount. The first op- 
eration in the ~substitution routine, NAW, tallies 
one for each time the routine is used. This tally 
is kept in a special location, W, in the storage 
memory. The executive routine, just before it 
seeks a new subsidiary problem, checks the cumu- 
lative tally in this register, Ex(CW), and if 
the tally exceeds a given limit, terminates the 
program. Since the substitution routine is ufed 
in each of the methods, the number of substitu- 
tions attempted seems to be one reasonable in- 
dex of the amount of work that has been done. 

This stop rule operates as a global constraint 
on the total work applied in trying to prove a 
single theorem. The rule does not govern the 
direction in which this effort is expended. The 
latter is determined by the priority rule pre- 
viously described for selecting subsidiary prob- 
lems from the problem memory and by the other 
elements of LT's program. 


Learning Processes 


The program we have described is primarily a 
performance program rather than a learning pro- 
gram. But, altho the program of LT does not 
change as it acc tes experience in solving 
problems, learning does take place in one very 
important respect. The program stores the new 
theorems it proves, and these theorems are then 
available as building blocks for the proofs of 
subsequent theorems. Thus, in the theorems used 
as examples in this paper, 2.06 was proved with 
the aid of 2.05 and 2.04, and 2.08 was proved 
with the aid of 2.07. Without this form of 
learning it is doubtful whether the program would 
prove any but the first few theorems in Chapter 
2 in a reasonable number of steps. 


TET 


The Complete Program 
for the Logic Theorist 


This Section is divided into two parts, The 
first part constitutes the program as described 


7h 


in the text, including the following routines: 

Ex; MCh, MDt, MSb; LMc, LSb, LRp~v, LRpv, VW, 
VCt; CX; CSm, CD, D, NK, NH, NJ. These routines 
are preceded by a list of the most important prin- 
itive IP's—those that are used in several rou- 
tines. Following each routine is a supplementary 
list of primitive IP's used in the definition of 
that routine. 

The second part of this Section consists of 
routines for five IP's--those Store instructions 
that are marked with asterisks (*)--which up to 
this point have been treated as primitives. 


Principal Primitive Instructions 


A_OPER LG RB 


B b Branch to b (+d). 

BHB In higher instruction, ~b, 

BHN In higher instruction, ~next. 

FEF xy »b ‘Find the first E in A(x) and 
put in y; if none, ~b. 

FEN xy »b _ Find the E in A(x) next after 
E(y), put in y3 then ~b. If 
none (end of list), ~next. 

bp eek dg Find EL(x) and put in y; if 
none, leave y blank. 

FR xy Find ER(x) and put in y; if 
none, leave y blank, 

PE xy Put E(x) in E(y);E(x) remains. 

Ss x Store E(x) back in A(x) (match 

on P)3; if not there,store E(x) 
at end of A(x). 
SEN xy Store E(x) as next E in A(y); 
E(x) now last item in A(y). 
*SX xy Store a copy of X(x) at (new) 

‘ A(y). E(x) = M bag 

wT x b If C(x) = ~+ (implies), »b, 

Vv x b If E(x) = V, »b.. 

A_ OPER LCRB Seg. 
eee Executive routine 


(Read problem X) R 
(Put EM(X) in 1) 
-MSb 
~MDt 
-MCh 
SEN 
CWG 
_ FEF 
NK 
-FEN 
NK 
CKG 
PE 
PK 
B 
E 


MSb 
MDt 


X(1) is finished, 


CK Find problem with 


lowest K,. 


YNNVDDVR ye HR Re 
WWD FPHe BD HY © 
Q Qe 8 Be OQQN 


CX Remove duplicates 
ee of previous problems. 
FEN 


B 


oOra-r 


Q 7 Se 
> ww y 


(Write proof.) WP 
(X(1) a theorem) ST 
(Stop) 


Succeeds in proving P, 


H (Writesno proof) WNP Fails to find proof. E TF L B CN 
(Stop) ery. H 
-CN Lc oD 
Ie ves BHN 
CK xy b_ If K(x)>K(y), -d. F -LRpw L G Rp LRp's are self-testing. 
CWG b If W (work done) > limit, +b, LRpv+ L H 
E xy Erase E(x) in A(y). G IM LC HH 
BHN 
Note: There are six IP's in the executive routine H BHB 
that are not formally defined in LT. These are 
written in parentheses above: read problem, find ives 
problem and put in working memory 1, write proof, 
store expression as theorem, write "no proof", cc xy »b_ If C(x) = C(y), 
and stop. CGG xy »b If G(x) > a} be 
CN xy b If N(x) = Ny), +d. 
Laka ot ee FM oxy Find EM(x) and put in y. 
A_OPER LCRB_ Seg. NSGG xy Subtract G(x) from G(y). 
Substitution method .F x b If E(x) is free, +d. 
If can't prove X(x) by 
MSb_ x b substitution, ~b. A_OPER LCRB Seg, 
NAW NAW Count one unit:of work. Substitution Fatiene 
Ww L Sm Substitute X(x) for 
FEF T1 C —1Sb__ xyz E(y) (=¥) in X(z) (=M). 
" Ae 1 OD FEF Lil F F 
B FEN Tl A Find next T and repeat. A ue i pF E(1) must belong to X(x), 
a B FEN Ll A i 
C FEF R2 F Sb_ Search thro X(z), 
oc Dy 2¢ & 7 
BHN ene ee | 
NAGG 2 3 G's add in Sb, 
. SXE 32 
Primitives E FEN R2 OD Find next E(z), repeat, 
NAW Add one to W (work done). ten 
A OPER LCRB Seg, G Lin een LSb 
Matching routine 
Match X(x). to X(y)3 if B Cc 
IMc xy pb can't, b. 
Primitives 
CcGaaG CL A T He hee 2 Hr) 
CcGG" 4 CC. Cc Now G(x) = G(y). x 88 an unused name to , 
TV ro E eS @) CN x b If N(x) = N(y) ~b. 
wv C D CPS xy »b If E(x) subelement af Ey)~b 
<~c Lc F (P(x) > P(y)). 
FL Ll Me NAGG- x y Add G(x) to G(y); result in 
FL «6G 2 G(y), 
IMc¢ 12 H Mc left subexpression. *SXE xy Store X(x) in A(y) in place 
FR L3 of E(y) (=v). 
FR Ch 
IM¢e 34 H Mc right subexpression. A_OPER LCRB Seg. ; 
ee Detachment method 
If can't prove X(x) by 
AW .L H  Sby detachment, ~b. Store 
-Ir 4. H Jap eb new problems in P. 
B NSGG LC : 
FM L5 Assures Sb everywhere. FEF T1 Cf 
Isb CL5 A To 1 B T must have C =~, 
BHN Ww 1 
FR 12 
Cc TV Cc H Sbx Ww L SV 
D -TF C H CSm L2 OD 
NSGG CL vet «=O SmCt Change view. 
FM C5 Assures Sb everywhere. CSm L2 OD 
TSbeebic > B FEN Tl A Find next T and repeat. 
BHN C BHB 


75 


A OPER LCRB 


"y woQ ww =: 
is) 
4 


Q 


#SXM 


BANUEPEFEWWW AH EHH 


new WN KARP UAU ES 


RARWEW-H 
~~ aunt w 


4 
o 
) 


xy 


WHE 


NUPAIAAH AND 
NAY 


o 


au 


rFyiQh Ww 


Copy, to work on T. 
Mc 
P 

Create new X. 

Stored away fixed ME. 
MSb 


Store X(x) at (new) A(y) as 
main expression, 
Seg. 
Chaining Method 
If can't prove X(x) by 
chaining, ~b; Store new 
problems in P, 


T C(x) mst be >. 


T must have C =>, 


Copy, to work on T. 


McF 
SmB 


Find next T and repeat. 


Put E(2) and E(6) in 
proper wkg. memory. 


S Create EM for new X. 
Fix connective. 
Store parts. 


MSb 


Put C(x) = + (implies). 
Store X(x) in A(y) as XL(y). 
Store X(x) in A(y) as XR(y). 


A OPER LCRB Seg. 


> 


Replacement of — with v. 
If C(x) =, replace 


with v3; if not ~b, 
T 
Py Fix E(x). 

Fix EL(x), 


76 


Primitives _ 
NAG x Add one to G(x). 
PCy x Put C(x) =v. 

A OPER LCRB Seg. 

LN peg iat oie ow es as ; 
FEF. Lil T 

A PUB 1 Erase old unit. 
~-TV al B 
PU 1 P 

BS 1 
FEN Lil A Find next E and repeat, 
BHN 

Primitives 
PU x Make E(x) a unit, (U). 
PUB x Make U(x) blank. 


A_OPER LCRB Seg, 

View as contracted 
Make units of binary 
expressions and 
isolated variables. 


< 
Q 
, 
Q 
La] 


vct 


Ne 


Recursion 


Recursion 


<j 
>) 
ra 
PennrPree ed 


Ct Blank V's of Ct unit. 


c Give X(x) a name if 
one needed, 


nn 
Perr nnreren 


Make left (isolated) 
variable a unit, 
XR(x) still to be done 


Pe 


Make right (isolated) 
variable a unit. 


Primitives 
ANGIatex Assign E(x) an unused name. 


(See VV for PU and PUB) 
TN x b If E(x) has a name +b, 
A OPER LCRB Seg. 


Re ement of v with > 
If C(x)=v and G(EL(x. 


>0, replace v with 73 


A_OPER LCRB Seg, 


FL 
FL 
-CX 
FR 
FR 
-CX 
BHB 


A -TV 
-CN 
BHB 


B BN 


Primitives 


ha a oll! 


PRM ean nae 


GO RRM at Ee arta 
FrwWwWnnkr| 


ea 


Mr nN 


fox! \E) 


wWra > 


OWrww 


Sb 


CX 


CN 


if not >b. 


Fix x. 


Find EM(x) and put in y. 
Add one to G(x). 


Subtract one from G(x). 


Put C(x) =>. 
If G{x) >0 9b. 


Com expressions 
Compare X(x) with X(y)3 if 
they match, ~b. 

G(L) = G(R), otherwise 7B. 


C(L) = C(R) 
Recursion down tree of 
expressions. 


L and C both variables; 
with identical names, 


(For CC, CGG, and CN, see LMc) 


A OPuR LCRB Seg. 


C2 
wo 
[=] 
tad 


WoOQONnhrRrREH 


Die 


Ew 


b 


Similar expressions test 
If DL(x) = DL(y) a) 


DR(x) = DR(y), %b. 


1h 


D 4 
-CD Le AeecD 
-CD Dr wen 
BHB 
A BHN 


A OPER LCRB Seg, 


EEG) =EG), a Ig), 


and H(x) =H (y) 4b. 


-CK Ct are Def: If K(x) = K(y) -b. 
-CJ LeCamrA Def: If J(x) = J(y) 7b. 
-CH Ly Gt oak Def: If H(x) = H(y) 7b. 
BHB 

A BHN 

A_OPat Lo RB Seg. 

Sa Deere eee Describe 


A OPER LC iB Seg. 


AND £58 <5 eee Count variable place 
FEF 1 ¢ 

A=CES9 1719) B 

-TU Al B 

NAH L 
B FENY Li A 
C BHN 
Primitives 


CPS xy  »b If E(x) subelement of E(y) 1». 
(Px) ar Gy) ys 

NAH x Add one to H(x) 

TU x  » If E(x) isa unit, +b. 


A OPER LCRB Seg, 


NJ x Count distinct variables 
AA 1 List for counted-V. 
FEFa aLien ElSik Find first E of X(x), 
A =-CPS 2 jy 29) 
-TU 2 D 
FEF Is She AG Find first V of list, 
BUCN) POgLaD GCN 
BENG ess. Find next V of list, 
C SEN 2. al 
NAJ L A 
FEN LQ Find next E of X(x), 
E BHN 
Primitives 
AA x Assign an unused list of A(x), 


CN xy b_ If N(x) = N(y), +d. 


CPS xy b If E(x) subelement of E(y), +b. 
(P(x)> P(y)). 

NAJ x Add one to J(x), 

TU x  »b_ If E(x) is a unit, ~b. 


A OPER LCRB Seg. 


NK___x Count Levels 
Tr thE KAS ED 
> ici 4B 
Fu (tt) NK 
KES 42 
FRO* (E2 
NK 2 
Ckoee2194Cs. Ck 
Pees GL KL 
A WAK L 
B BHN 
Cerrk. | 20 KR 
B A 
ves 
CKG xy b_ If K(x) >K(y), ». 
NAK x Add one to K(x). 
PK xy Put K(x) in K(y), 
TB x b_ If E(x) is blank ~, 
TU x b If E(x) is a unit ~b. 


PART 2: Reduction of procedural processes [*S] 


The Store instructions that rewrite expres- 
sions in various ways can be reduced to processes 
more like the rest of the primitive set. The new 
primitives required are (a) two (PA and CP) which 
belong to types of operations already considered, 
and (b) four of a new type to manipulate the P 
sequences, The latter operations insert and de- 
lete subsequences from the front end of a given 
sequence. Thus if P = LRRL and P! = LRRLRIR, 
then P'' = P! — P = RLR and P" + P = LRALRLR, 
Observe that subtraction can only be performed 
when the subtrahend is an initial segment of the 
minuend, and also that addition is not commutative 
All these routines involve bringing in the ele- 
ments, one by one, modifying them and storing 
them in the new list. 


Store a copy of X(x) at Store X(x) in A(y) in 
(new) A(y) (E(x) = M). nS of E(y) (E(y) =V) 
take E(x) from w.m.) 
A OPER LC RB 
A OPrR LCRB 


E 
Qawu 


w 
wa 
238 

NQKHORPRPE EH 

NNNNEK 


78 


Store X(x) at (new) C FEN Ll A 
A(y) as main expression D BHN 


A OPER LCRB E PE L2 
cr SITTER NT” B B 
SXM_xy 
At ee 
FEF y 44-76 
A CPS 1L B 
PE 12 
MM C2 
HSPP L 2 
s 2 
B FEN Ll A 
C BHN 
Store X(x) in A(y) Store X(x) in A(y) 
as XL(y). as XR(y). 


A_ OPER LCRB A_ OPER LCRB 


a SEU —SiR_ xy 
FEF, . L722 ¢ FEF? 4Y2- 1:6 
A CPS 1L B cPS 1L B 
PE 12 PE 12 
PM C2 PM’ C62 
HSPP L 2 HSPP L 2 
HAPL 2 HAPR 2 
-HAPP C 2 HAPP C€ 2 
s 2 s 2 
B FIN “Ll & FEN Lil 
C BHN BHN 
Primitives 
AA x Assign an unused list to A(x), 
CP xy »b If P(x) = P(y) +b locates 
"same" element even though V, 
G, etc. have been modified). 
CPS xy »b If E(x) subelement of E(y),~b 
(P(x) D> P(y)). 
HAPL x Add a Left to front of P(x). 
HAPR x Add a Right to front of P(x). 
HAPP xy Add P(x) to front of P(y). 
HSPP xy oo eae P(x) from front of 
: P(y). 
PA xy Put A(x) in A(y). 


Conclusion 


In this paper we have specified in detail an 
information processing system that is able to dis- 
cover, using heuristic methods, proofs for theo- 
rems in symbolic logic. We have confined our- 
selves to description, and have not attempted to 
generalize in abstract form about complex in- 
formation processing. Because of the nature of 
the description, involving considerable rigor and 
detail, it may be useful to set out in conclusion 
the main features of LT, especially as these 
appear to Feflect basic characteristics of com- 
plex systems, 

First of all, LT can bé specified at all 
only because its structure is basically hierar- 
chical, and makes repeated use of both iteration 
and recursion. So true is this, that one of LT's 


main features, the use of a problem-subproblem 
hierarchy, is hardly visible in the program at 
all. 

LT offers no guarantee of finding a proof; on 
the other hand, it brings to its task a number of 
different heuristic methods for achieving its 
goals, Allof these methods are important in mak- 
ing LT sufficiently powerful to find proofs in 
most cases, and to find them with a reasonable 
amount of computation, but not all of them are 
essential. Without chaining, for instance, LT 
could still function. The methods MSb and MDt 
still provide it with ways.to prove theorems—and 
even some theorems more easily provable by MCh 
would yield to the more directly "brute force" 
approach of the other two. 

LT is still a very simple process compared, 
for instance, with the array of metheds, tech- 
niques, and concepts used by a human logician. 
For example, the concepts of commutativity and 
associativity are nowhere to be found in LT. The 
analysis of LT and its variations is a subject 
for later papers. However, the following 
facts, based on hand simulation, may help put LT 
in perspective. LT will prove in sequence most 
of the 60 odd theorems in Chapter 2 of Principia 
Mathematica. With some extension in the variety 
of methods and cues employed, it will prove most 
of the theorems in Chapter 3, in ch another 
connective, "and," is introduced.1& We know 
nothing, as yet, about what will be required for 
an extension to the predicate calculus or to other 
types of problem solving, 


19 


1. 


LT uses similarity-testing and matching as 
a multi-stage search and selection process, The 
questions of efficiency involved in such processes 
have already been commented upon in Section II. 
Additional variation and complexity enters the 
program through the alternative modes, VV and VCt, 
for perceiving the logic expressions in the course 
of testing similarity and of matching. 

In these and other ways, the logic theorist 
is an instructive instance of a complex informa- 
tion process. We expect to learn more about such 
processes when we have realized the logic theo- 
rist in a computer and studied its operations 
empirically; and when the logic thecrist will 
have been joined by similar systems capable of 
performing other complex information processing 
tasks e 


References 


Bowden, B. V. (ed.), Faster Than Thought 
(London: Pitman, 1953), pp. 161-198. 


2. Hilbert, D., and W. Ackermann, Principles 
of Mathematical Logic (New York: Chelsea, 1950), 
Chapter 1. 


3. Whitehead, A. N., and Bertrand Russell, 
Principia Mathematica, vol. 1, 2nd ed. (Cambridge: 
Cambridge U. Press, 1925). 


16 A program to do this has been developed and 
hand simulated by Mr. Kalman Cohen. 


TESTS ON A CELL ASSEMBLY THEORY OF THE ACTION OF THE BRAIN, 
USING A LARGE DIGITAL COMPUTER 


N. Rochester, J.H. Holland, L.H. Haibt, W.L. Duda 
IBM Research Laboratory 
Poughkeepsie, N.Y. 


Abstract 


Theories by D.O. Hebb and P.M. Milner 
on how the brain works were tested by simulat- 
ing neuron nets on the IBM Type 704 Electronic 
Calculator. The formation of cell assemblies 
from an unorganized net of neurons was demon- 
strated, as well as a plausible mechanism for 
short-term memory and the phenomena of growth 
and fractionation of cell assemblies. The cell 
assemblies do not yet act just as the theory re- 
quires, but changes in the theory and the simu- 
lation offer promise for further experimentation. 


Introduction 


The problem of how the brain works can be 
approached by investigating the elementary com- 
ponents, the neurons, and then seeing how larg- 
er and larger assemblies of these operate. Or 
it can be approached by observing the behavior 
of the entire organism and working back to de- 
termine what the components must be. The for- 
mer activity is called neurophysiology and the 
latter is called psychology. Before we can say 
that the problem is well in hand, these two ap- 
proaches must meet in the middle so that we 
have a single consistent picture that firmly con- 
nects psychology and neurophysiology. 


As the neurophysiologist considers more 
and more complicated structures of neurons he 
gets into problems that are less and less related 
to his normal way of thinking. Curiously, how- 
ever, some of these problems do not begin to 
resemble parts of psychology. What is happen- 
ing is that the neurophysiologist is beginning to 
think about information handling machines that 
are too complex to be understood without the 
specialized knowledge of other disciplines. 
These other disciplines are information theory, 
computer theory, and mathematics. People in 
these other fields need to augment the work of 
the neurophysiologists and psychologists before 
the brain can be properly understood. 


In the experimental study of the brain it is 
not yet possible to observe well the electrical 
interconnections among neurons. No one has 
yet been able to simultaneously record input 


80 


and output signals of a single neuron in the 
brain. For this reason it has not yet been pos- 
sible to test certain theories about how the brain 
works by experimentation on animals. 


It is possible to measure the electrical 
characteristics of an isolated neuron in some 
circumstances. “’“ One can imagine an elabor- 
ate network of such neurons and conjecture on 
the behavior of the network. The analytical 
treatment of these networks has proved that one 
can construct any desired kind of logical ma- 
chine from elements that are probably much 
less powerful than neurons. ”» 


The analytical approach has not been very 
effective in actually describing the behavior of 
complicated networks of neurons. However, it 
has proved effective to simulate such networks 
and to draw conclusions from the behavior of 
the simulated network of neurons. 


Two sets of simulation experiments were 
made and another is in progress. In the first 
of these it was possible to simulate a network 
of up to 99 neurons and a test was made of part 
of the theory advanced by D.O. Hebb in his 
monograph, The Organization of Behavior. 5 
The second set tested an unpublished revision 
of P.M. Milner of part of Hebb's theory with a 
network of 512 neurons. The third set is to 
test a further revision. In each case the orig- 
inal neurophysiological theory had to be inter- 
preted in order to get something definite enough 
to simulate, and these interpretations were done 
by the present authors. 


THE 69-NEURON DISCRETE PULSE 
SIMULATION 


In this paper the term "neuron" will gen- 
erally be used as an abbreviation for the term 
"simulated neuron", Likewise the term ''synapse"' 
will be used to stand for the term "simulated 
synapse"', in other words for the simulation of 
the coupling mechanism that enables one neuron 
to send signals to another, Where ambiguity 
could arise, qualifying adjectives will be used. 


The basic idea of the simulation can be seen 
by reference to Fig. 1. The large rectangle in 
Fig. 1 stands for all of the 2048-word high speed 
electrostatic memory of the Type 701 calculator. 
The memory was divided into 70 parts, one for 
each neuron and one for the program. In the 
area reserved for each simulated neuron were 
some numbers that might theoretically be 
measured on a corresponding living neuron. 
These numbers gave all of the information that 
was needed about each neuron. Specifically, the 
things that were known about each neuron, either 
from its location in memory or from the numbers 
stored there, were: 


1. It's number (name) 
2. How long since it had fired 
3. How tired it was from having been 
fired excessively 
4.  For-each of 10 output (efferent) 
synapses: 
4,1 The number of the (efferent) 
neuron that it simulated 
4,2 The magnitude of the signal that 
it sent to that (efferent) neuron 
when this (afferent) neuron fired. 


Under control of the program, the calculator 
repeatedly scanned the 69 neurons and, by 
making calculations, caused these numbers to 
change as they would have changed if the net- 
work had actually been constructed. Therefore, 
after each pass over the data in memory, the 
data represented, in great detail, the state of 
each neuron and synapse in the network at the 
next instant of time. 


In this model, time was quantized into time 
steps. A neuron could fire at any time step, 
but not between. A time step corresponded ap- 
proximately to the interval between the firing of 
one neuron in a chain to the firing of the next. 
In the simulation, the average length of time re- 
quired for a single time step was about 5.3 sec- 
onds and this corresponded to perhaps 0,7 milli- 
seconds in the brain, Therefore, the simulation 
was slower by a factor of 7600. 


At any given time step a neuron was either 
fired or in some state of recovery from being 
fired. Various recovery curves were used and 
the one shown in Fig. 2 was typical. . 


During any given run on the calculator, the 
neurons were interconnected ina particular net. 
Each neuron was connected so as to stimulate 10 
other neurons. Usually the net was designed by 
the calculator. It would make a random choice 
of the neuron to be stimulated by each of the 10 


output synapses of each neuron. It would record 
these choices on punched cards and retain them 
for the rest of the run. 


If a neuron fired at time step (n-1) it would 
stimulate 10 neurons so as to tend to cause them 
to fire at time step n. The size of the signals 
sent to the 10 neurons would depend only upon 
the fact that the original neuron fired and upon 
the magnitudes of the interconnecting synapses. 
To say this another way, the input signals toa 
neuron, together with its threshold, would deter- 
mine whether or not the neuron would fire, but 
if it did fire, the strength of firing would not de- 
pend on the input signals. 


The input situation of a typical neuron is 
shown in Fig. 3 with some possible values of 
synapse magnitudes. The behavior of neuron 
x is shown in the following table. 


Neurons that Mag Thresh- Would x 
fired on old of x fire on 
step (n-1) step n ? 
AB 295 256 Yes 
BCE 252 256 No 
BCDE 336 256 Yes 
ABCDEFG 839 839 Yes 
ABCDEFG 839 938 No 
ACF 408 376 Yes 
EFG 376 376 Yes 


It can be seen that the input circuits to such a 
neuron can provide quite sophisticated switching. 


Not all of the properties of the simulated 
neurons have been described. However, to make 
the exposition easier to follow, it is convenient 
to skip ahead and show some observations on the 
behavior of networks of neurens. Except for 
some minor difficulties, this behavior would be 
obtained with neurons like those already des- 
cribed, The discussion of these minor difficul- 
ties will be clearer after showing these results. 


Fig. 4 shows an example of what will be 
called diffuse reverberation. Each row in this 
figure indicates with a 1 those neurons that 
fired and with an 0 those neurons that did not 
fire ina particular time step, Each column, of 
the 64 columns at the right, shows the history of 
a single neuron, The right hand 64 columns of 
Fig. 4 show, therefore, the complete firing his- 
tory of 64 neurons for 50 time steps. 


Fig. 5 shows, as a function of time, the 
number of neurons that were simultaneously 
fired. The time covered here is a little larger 


than in Fig. 4 and shows the complete history 
beginning with a quiescent net and continuing 
until the activity died out. 


We propose this diffuse reverberation as a 
plausible mechanism for short term memory, 
the kind of memory that is involved in remem- 
bering the intermediate results in mental arith- 
metic. We will discuss later some conjectures 
as to how the brain can make use of such a mem- 
ory mechanism, 


Now another property of the neurons will be 
described. When neuron A participated in firing 
neuron B, the synapse that enabled A to stimu- 
late B was increased in magnitude unless it al- 
ready had reached the limit of 938, in whichcase 
it remained constant. This characteristic was 
our version of Hebb's basic neurophysiological 
postulate. Hebb postulated that, "When an axon 


peatedly or “persistently takes takes part in ‘in firing 
it, some ‘growth process or “or metabolic change 


takes pla takes place in one or both cells such that A's 
efficiency, as one ne of the cells cells firing B, B, ig is” 
increased, "' 


This property of simulated neurons is some- 
what curious. No process of jyst this sort has 
been observed in living tissue. However, it has 
not been possible to demonstrate, by measure- 
ment, that the Hebb postulate is false. Nothing 
else has been observed that could account for 
learning and memory ina plausible way. The 
Hebb postulate suggests a plausible machine 
that does not contradict experiment. 


The purpose of the assumption about the 
growth of synapses is to get a mechanism for 
the retention of long term memory. When an 
animal experiences some event there will be 
activity in its brain. This activity will consist 
of a spacial and temporal pattern of firing of 
neurons. During the experience, the synapses 
involved will be strengthened, according to 
Hebb's postulate. Therefore, the same, or a 
similar, sequence of neural events is more 
likely to take place later than it would have been 
if the animal had not had that experience, A 
repetition of some part of the neural events that 
were associated with an experience is assumed 
to be the act of recalling the experience. It is 
evident that the mechanism that Hebb postulated 
would tend to cause recollections. The ques- 
tion of whether or not the postulate is sufficient 
is, ina sense, the main topic of this paper. 


If no additional rule were made, the Hebb 
postulate would cause synapse values to rise 
without bound. Therefore, an additional rule 


82 


was established: The sum of the synapse values 
should remain constant. This meant that, ifa 
synapse was used by one neuron to help cause 
another to fire, the synapse would grow. On the 
other hand, if a synapse was not used effective- 
ly, it would degenerate and become even less 
effective, because active synapses would grow 
and then, to obey the rule about a constant sum 

of magnitudes, all synapses would be reduced 
slightly, so the inactive synapses would decrease. 


Before discussing network action further, 
another property of the neurons will be men- 
tioned, A neuron fired at too high a frequency 
becomes less sensitive, so that more stimula- 
tion is required to fire it. The effect of this is 
shown in Fig. 6, which shows the threshold as a 
function of time when the neuron is fired repeat- 
edly with a constant level of stimulation. As 
with a living neuron, this simulated neuron fires 
rapidly at first and then settles down to a lower 
rate of firing. 


This process is called fatigue because of 
the obvious analogy to living neurons. A signifi- 
cant aspect of fatigue is that it is a form of mem- 
ory and, as such, may plan an important part in 
the operation of the brain 


The concept of cell assembly occupies a 
key position in Hebb's theory. A cell assembly 
is a group of neurons that are interconnected in 
a very complex fashion and within which diffuse 
reverberation can take place. Fig. 4 shows 
just such a situation. 


Parts of the cortex are imagined to consist 
of a large number of cell assemblies, each of 
which contains a large number of neurons, Only 
a small fraction of the cell assemblies are 
aroused at any one time. In other words signals 
are reverberating in only a few cell assemblies 
at once. Just which cell assemblies would be 
aroused at any one time would depend in large 
part upon what cell assemblies had been aroused 
at a previous instant of time, and in small part 
upon signals from elsewhere. 


In the language of information theory, this 
part of the brain can be considered to be a finite 
state transducer, in which the internal state is 
determined by noting which cell assemblies are 
aroused and which are quiescent. In other words, 
the brain should exhibit a kaleidoscopic sequence 
of patterns of cell assembly arousal. It is out- 
side the scope of this paper to expound Hebb's 
theory, so it will be assumed henceforth that the 
reader either understands the significance of a 
finite state transducer or has read Hebb's book. 


In passing, it is worthwhile to point out how 
appropriate the finite state transducer descrip- 
tion is for Craik's "hypothesis on the nature of 
thought, '' 


Hebb's theory required that it be possible 
for a neuron to belong to several different cell 
assemblies and that not all of these assemblies 
be aroused at once. Hebb's theory also required 
that it be possible for a neuron to change its 
affiliation from one cell assembly to another. It 
may be possible to devise a theory that has only 
the second requirement, but no further consid- 
eration of this possibility will take place in this 
paper. 


The problem of how cell assemblies can 
arise and how they become modified, is vital to 
this theory. It will be shown that Hebb's scheme 
is unlikely to work with neurons of the type des- 
cribed so far. It will also be shown that, by 
suitably improving the neurons and by making 
the network more complex, cell assemblies 
can be made to form spontaneously. It will fur- 
ther be shown that these cell assemblies are not 
entirely satisfactory but that there is a plausible 
course for further investigation. 


Suppose that there is initially some activity 
in a network of neurons and that input signals are 
impinging on the network. Suppose also that from 
time to time a particular input signal, S, arrives. 
When S first arrives, it will impinge on some in- 
ternal state, I;. In other words it will impinge 
upon some Sortcnias configuration of states of 
individual neurons. The particular sequence of 
internal states, I j+1? Lit2: Li43) -e., that fol-' 
lows will strengthen aii erebsss in sucha 
way that the sequence I Tit 1? I $2? Ly 32 eee 
is more likely to occur tacaine i was conjec- 
tured that the next time S occurred, some part 
of I, would be in existence and that some part of 
the sequence I; aE: vp Li42 1543: -.. would be 
reinforced. As S appeared repeatedly some 
characteristic response to S gradually would be- 
come sufficiently reinforced as to be identifiable. 
As the characteristic sequence was arising there 
would appear, init, points where diffuse rever- 
beration could occur. In other words there 
would be some internal state 1j,;}, which would 
repeat some part of an earlier state in the se- 
quence. As_soon as this happened the rate of 
reinforcement of the connections would increase, 
because each time the stimulus S occurred the 
sequence of states would be such as to give sev- 
eral reinforcements to some of the connections 
instead of just one reinforcement. It was con- 
jectured that cell assemblies corresponding to 
some common stimuli would arise in the brain 
in this way. 


83 


In order to test this conjecture about the 
manner in which cell assemblies form, a pro- 
gram was written to generate an appropriate 
environment for the neuron network, and an 
arrangement was set up for the network to re- 
ceive signals from the environment, To receive 
the signals, six neurons were chosén to act as 
receptors. It was arranged that no neurons 
would stimulate these receptors. Instead, they 
could be fired only by an external program to 
enable the calculator to reach in and modify the 
one bit on each of the six neurons that indicated 
whether or not it had just received enough stim- 
ulation to fire. The synapses from the receptors’ 
spread out diffusely through the network. 


The neuron net was stimulated once every 
ten time steps with a 6-bit signal that could de- 
fine the state of each of the six receptors. The 
signals were chosen by a program whose action 
is illustrated in Fig. 7. It is a Markov process 
in which there is some probability that the input 
will be random but mixed in with the random 
signals are frequent occurrences of certain se- 
quences. The network then had the opportunity 
to develop a characteristic response to each of 
the three sequences. 


The network did not develop any character- 
istic responses and there was no sign of devel- 
opment of cell assemblies. A number of vari- 
ations on this experiment were tried, all with 
the same result. Then the reason for the diffi- 
culty was realized and a simulation experiment 
was run to verify the explanation. 


In such a neuron network, the idea thata 
detailed temporal-spacial pattern of firing can 
be effectively reinforced by a partial repetition 
is false. The reason can be seen from the fol- 
lowing experiment. A simulation experiment 
was run to a convenient point where diffuse re- 
verberation was taking place. Then all the data 
was punched on tabulating cards. These cards 
contained all relevant information so that, if 
they were read by the calculator, the simula- 
tion would go on from where it left off. Before 
the cards were read by the calculator, however, 
they were reproduced to give four identical decks 
of cards. Then three of these decks were slight- 
ly modified, each in a different way. In each 
case the modification was to choose some neuron 
that was about to be fired and manually change 
the number that specified its state of recovery 
so that it wouldn't fire quite so soon, Then four 
simulation experiments were run, one with each 
deck. 


The four sets of results were compared and 
it was found that the detailed patterns of firing 
diverged rapidly. In just ten time steps, in each 
case, over 30 per cent of the neurons firing were 
different. This result is shownin Fig. 8. This 
shows that even slight differences rapidly grow 
to be large differences so there is little chance 
that a detailed pattern of firing can be effectively 
reinforced. 


It was concluded from this work that some 
additional structure was needed within a network 
to allow all assemblies to form. A plausible 
model of a short term memory had been demon- 
strated but rather convincing evidence had been 
found to show that Hebb's postulate was not 
enough to make cell assemblies form. 


Some other experiments were run which 
coincided in time with the work of Farley and 
Clark? and which_reached essentially identical 
results. However, these did not seem to throw 
any light on the central problem of how the brain 
works, so this line of investigation was dropped. 


512-Neuron F.M, Simulation 


At this point we conferred with D,O. Hebb 
and one of his people, P.M. Milner. Milner had 
been working on a revision of part of Hebb's 
theory to introduce more recent neurophysiolog- 
ical data. The essence of Milner's idea was that 
inhibitory synapses, as well as excitatory syn- 
apses, are needed and that within a cell assembly 
most synapses are excitatory, while between cell 
assemblies most synapses are inhibitory. This 
idea sounded to us like a plausible cure for the 
troubles in the first model. It made engineering 
sense. 


The significance of the idea can be seen by 
considering two cell assemblies. These will act 
like an Eccles-Jordan Flip Flop circuit. Suppose 
one is aroused, It keeps itself going by its inter- 
nal excitatory connections and keeps the other 
quiescent by the inhibitory interconnections. 
Finally it begins to fatigue. As it begins to falter, 
it inhibits the other less strongly, so sporadic 
residual activity in the other begins to increase. 
This in turn inhibits the aroused cell assembly, 
causing it to falter more. This feedback con- 
dition causes an abrupt switching so that the 
aroused one becomes quiescent and the quiescent 
one becomes aroused. A more detailed discus- 
sion of this can be found in Appendix l. 


It seemed certain that the switching action 
would take place, but it was not clear whether 
the possibility of having inhibitory synapses 


8h 


would be enough to allow cell assemblies to 
arise or whether some cell assembly structure 
would have to be built in at the start. 


Experiments with the discrete pulse model 
indicated that diffuse reverberation was a fairly 
reliable sort of thing in a net of 63 neurons, but 
quite erratic ina net with 21] neurons. There- 
fore it was felt that in a new experiment there 
should be a larger number of neurons ina net. 
A major obstacle to this was that the calculator 
was not fast enough to manage a very much 
larger net, even though this was to be done on 
the Type 704 which is faster than the 701. Some-~ 
thing had to be sacrificed. 


It was decided to sacrifice the knowledge of 
exactly when an individual neuron fired, All 
that the machine or the experimenter could 
know was the frequency at which a neuron was 
firing, and not the exact instants of time at which 
it did fire. The frequency would vary from time 
to time, so this was called the FM model. 


One particular-version of the FM model will 
be described here. There were 512 neurons, 
each with 6 input (afferent) synapses and a num- 
ber of output (efferent) synapses that varied 
from one neuron to another. The synapse mag- 
nitude lay between -] and +1 and changed as long 
term learning took place. The frequency of a 
neuron varied from 0 to 15. Equations are given 
in Appendix 2 to specify precisely how these 
quantities varied from time to time, and a quali- 
tative description is given below in the text, 


The magnitude of a synapse was much like 
a correlation coefficient between the two neurons 
that it connected. If the frequencies of the two 
neurons usually went up and down together, the 
Synapse magnitude would grow toward +1l.. If, 
on the other hand, one neuron was usually in- 
active while the other was active, the synapse 
magnitude would approach -l]. This is the FM 
version of Hebb's basic neurophysiological 
postulate. 


The frequency of a neuron was obtained 
essentially by calculating, for each synapse, the 
product of the synapse magnitude and the fre- 
quency of the stimulating (afferent) neuron, add- 
ing these products, and normalizing. It was fur- 
ther bounded by not being allowed to go negative. 
Therefore a neuron could have a high frequency 
only if it was stimulated through positive synapses 
by neurons with large frequencies and not simul- 
taneously stimulated through negative synapses 
by neurons with large frequencies . 


- The fatigue increased if the frequency was 
high; stayed constant if the frequency was inter - 
mediate; and decreased if the frequency was low. 
Furthermore, it was not allowed beyond the 
bounds of 0 and 7. A fatigue of 7 could nearly 
stop a neuron while a fatigue of 3 did little to it, 


An important change was made in the nature 
of the connections in the net. A distance bias was 
introduced so that two nearby neurons were more 
likely. than two remote neurons to be connected 
together through a synapse. Inthe experiment 
described in this paper, the neurons were visu- 
alized as being arranged in a cylinder, as shown 
in Fig. 9. The cylinder was 16 neurons high and 
32 neurons around. If two neurons were within 
eight of each other, they were as likely to be 
connected by a synapse as any other two neurons 
that were within eight of each other, However, 
no neurons that were farther apart were connect- 
ed by synapses. 


Four blocks of four neurons each were se- 
lected to act as receptors. These four blocks 
are shown in Fig. 9. The procedure that was 
used most of the time was that receptor areas 
1 and 4 were controlled to have maximum activity 
for three successive time steps and then the net 
was allowed to operate with no external stimula- 
tion for three time steps. Then areas 2 and 3 
were controlled to have maximum activity for 
three time steps and then the net was again left 
alone for three time steps. This cycle was re- 
peated many times. The cycle was considered 
to be the equivalent of about 0.2 seconds in an 
animal and took about 160 seconds on the calcu- 
lator. 


Cell assemblies did actually build up around 
each of the receptor areas. 
bly the interconnections were largely excitatory 
and between cell assemblies they were largely 
inhibitory. 


The activity of each neuron at each time 
step for one complete cycle is given in Table is 


Within a cell assem- 


85 


Table 1 


Activity During One Complete Cycle of 


Stimulation 


IEG 


004.000000000060000001).0 314,100 300 
004.0000310010000007701000000000 
000000207000000000770100000120 00 
0.000000000005001000005000 3000000 
0010005 301100,00000610,000200000 
00000101000),00000005001100000000 
0000000000100004,0001000011000000 
00000010 000000000001100010000000 
000004.0000 30000 30000000000000001 
0 0000001000000h,10500000000),00110 
0.0000000000000600000004.100000000 
00000000000006000000000000000000 
000 00 33000000010000000001010000} 
00000000050010100 377000000010000 
0 0000000000000000077000 000001000 
000000 0005010000010 0000010000010 


2 


00,0 020 0000005000160000010000 300 
00 30200420020 00 20077060 300010000 
0000000050000020007 7507000000000 
000000003000002000500000010 00200 
000000h,1000000000000000000100000 
0000010 3100000 0000020 20100000000 
0004.0 000001000100000000001000000 
000000101000000000000000 00000000 
0000000000100001000700000000 0000 
000000011000001002000004.00000000 
000000000100004100410000000000 00 
0000000000000 310 01 300 00020000000 
000002 30000 000000000000000000000 
0000 0010000 310 0004,7 7000000 000000 
000102000000000004,77000000200000 
0 00001000000000 00000000 000000000 


3. 


0030011100000120 00670 07000000100 
0020200 3310 000020077010020000000 
00000000000010 300077507000000000 
00000001200000205 37 3000101 300000. 
00000000000 0000 000000000000 00000 
000 3000 30010000010 000 30100000000 
000300000000001000000 00000000000 
00 000010000000000000000000000000 
0000009C6000000001 307000000000000 
0000000.000010010010001106000000 
00030 00000000200004.0000010000000 
00000000000000000110000067000000 
00000020000000000000000107000000 
00000110 000500001177000000000000 


. 000001 30005001000777000000 300003 


0000010 00000000001 300 30000000000 


he 


002000100000105 0004 7007000000100 
00001002100000150025000000001000 
000000000 30020 3000 364.05 0010 00000 
00000000000001307556 301100410000 
00010000000000 000000000000000000 
000 30100001000001000050 000000000 
0000000 00010 00100020000000000000 
000000000000000000 000 0000 0000000 
00000000 000000000.0.000000000000 
00000000000010 0000100010060 0.0 00 
00030000000001000050000000000020 
00020000010000010000000577000000 
02001000000000000100 000007001000 
00000 310005501000012000d00000002 
0000006000700000070.7000 30030000 
000000000001000001310 30000000000 


5. 


000000002100 30520005005000000000 
00001000100000 3500210000010 3100 
000000000510002005 3500000 3000000 
00000000000000104 305402000 340010 
000 000000000000 00000010000000000 
0000000 0001000001000060000100000 
00000000000000000060000000000000 
00000000000000000010000000000000 
00002000 000000000120000000000000 
0.00000000000100000000010060 050 00 
0000010000000 000000000000000h,1 30. 
000000000 30000000000 300657000000 
~00101001000000000102000001000000 
00000 3000070 00000002003000000010 
1000006000 70000000030 00,0010000) 
00001000000 000000100000000000000 


6. 


000000005,0100250000 300000000000 
0001000000200101002000000000 3000 
00100000031011210525 000015000000 
00000000000000000100 30100000020 
30000000000 00 0000000610000000000 
000000000000000 30 00006001010 0000 
00000000000000100070100000000000 
000000 00000000000000000100000000 
0000 30000000000000 30000000 30 3000 
00000000000000000000 001000005001 
00000010000000001000000000015 30 
020011010 30000000000,04.5000010 00 
00000001000000000005000001003000 
01000000007001000000004.0000000 02 
0000 00011000000 00000005,00110200 
13002000000000000000000,00000000 


Te 


000000002 30 3001.00 00000000000000 
000100000010 00100110000000000 000 
00100000002001210200000035000000 
00000000000070000000000000010010 
50 0000000000200 00000000000010000 
010000000010000 30000010000000000 
0000000 00000 0000007010 0000100010 
000000 0000007 7000000000077000000 
000000000000770000200 3007760 3000 
00011010000000000000000000000002 
001000000000000030000000000 3,000 
001000010000000000004.0500100 3000 
10000060000000000005l,00000000000 
01000000000001000000022000000000 
10.000000100000000000005015000000 
03000010 0000 000000000005000 00000 


8. 


000000000020000110 00000000000000 
0000 0000000001000200000000100000 
00000000000000006011000750060000 
00000000001000000000000010010000 
60 000000000020050010100000700000 
0000000000000 301 100000 0000000000 
0000000000C 000000600000000101010 
10000000000577050700000077000000 
000000000000771000004. 33077500010 
000000000C100000090000000000000 
004.0000 00009 0900 3906000 20004 2000 
0000000000701009uU00C141900000000 
00000 00000000070 6006.09000601000 
010000@000000002900004.0000600000 
000000000000000000000000 3410000 
00000000000002000000000100000000 


96 


00000000000000000000000000000000 
000000000000000060000000 30070000 
000000000000000070000 00700 3600.0 
000000000000000000000000 30005000 
000000000010100500101000007 30000 
0000000001006200100000000000000 
00000000000000011000070000000000 
00000 0000005 77050700000077010000 
00000010000077700000144.77000030 
00000000000000100001000000000000 
106000 00000000001000000000140007 
0000000000 704.070000024.0000000000 
00000000001100700010000000700050 
00000000000000050000000005600000 
00000000000400140000001050010000 
0000000000000 3050000000000001000 


10. 


00000000000000001000000000001006 
0 00000000000000060000060000 70000 
000000000000000070000007104500).0 
00000000 000000050000000000005000 
000 3000000 3000:36020000001.0750000 
00000000100065 300200000000000000 
4,00000000000000 31000070 020000000 
0000000000067 7060700001020000000 
0000006000000 370000041 31700500 30 
00 0000000000004,0000600 0000000100 
10 000000000000001000 700000000007 
00000000105 35070 000 0000000000000 
0 0000000001 3305000600000,0600061 
00000000000000 060 600020005501 300 
0000100000040 04. 7000020 00 30020000 
000000000 33004050000 00000001 3000 


ll. 


00000000000000004.100020,0010 3005 
0010000000 000000 7000006002060000 
000000200000000020 00020200510050 
0010000000002006000000000000500¢ 
0000000 3005002002 3000 30060000 
000000001 3037140001 3003000000000 
5000000000000005 3002070030020000 
.000000000001160102001000,0001N00 
000000 7000000 3400000102200050000 
000000000 00000.00006009000}4.0000 
20000000000100201201705001100005 
00000000100 3206000000000N0010100 
000000100033600000500010700001h.0 
0000000007000001,000002000.01), 300 
000000000004,005 6000000002001 3000 
00000000071,0020600000010000 30000 


12s 


000000000000010 064.0004,060 3005006 
00100000000000001000005.0 3000000 
000000004.0000000100004.0000110010 
001000000000 3006000004.0000000000 
00000045004005400.041,0 3050160000 
00000000000510100104.005000010000 
1,00000001010000}4.0000000 30050100 
0000 0000200022010 0002001 30000000 
0000 004.0 00 300 001.000000020003002 
000000000000002000031002007].0000 
00000000000000),.014.02606 300100000 
00000000000 304000000 0000001 30032 
00000 31001 3120000030003040140004, 
000000000700 000 000000 3010001 3300 
00000000000000 31100010 0000010000 
0000000005210 3000000001000030001 


87 


‘ determined by the input. 


In Table 1 the numbers are just half of the 
frequencies of the neurons. This was done to 
reduce the cost of printing. Since it is difficult 
to see what is going on without practice and effort, 
a small section of the sixth and ninth steps have 
been reproduced side by side in Fig. 10. These 
times ‘were chosen to contrast the arousal of 1 
and 4 which suppress 2 and.3, with the arousal of 
2 and 3 which suppress land 4, Nearly every 
neuron has chosen allegiance to one cell assembly 
or another. Only 3 of the 224 neurons shown are 
active at both times. 


Examination of the synapses also showed 
that cell assemblies had formed, A dividing 
line was found between area 1] and area 2. 
Synapses that crossed this line were predomi- 
nately negative while synapses that failed to cross 
it were predominantly positive. 


There is no doubt that cell assemblies did 
form. <A very detailed statistigal study of the 
allegiance of neurons to cell assemblies was not 
made because, as will soon be evident, the model 
still needs improvements, and the statistics of 
the improved model would be different. 


A further significant characteristic needed 
by the Hebb-Milner theory was evident. Over 
sequences of 100 or so time steps (perhaps the 
equivalent of 17 minutes) neurons were observed 
to change allegiance from one ce'l assembly to 
another. In other words, "fractionation" and 


-"recruiting'' were observed. 


Some evidence was found to indicate that one 
cell assembly tended to arouse another. However, 
the tendency was weak. It can be seen that the 
only possible excitatory synapses between cell as- 
semblies would be those involved with neurons of 
dubious allegiance. Apparently this was insuf- 
ficient to allow spontaneous activity in the net. 
The only activity was caused by the input signals 
and the arousal of cell assemblies was completely 
The theory requires 
that the preceding central activity (set) be much 
more influential than the input stimuli, so clearly 
some changes are needed. 


Plans for the Future 


After studying the detailed results of this 
experiment, we arrived at 2 conjecture as to 
what should be done next. his conjecture was 
based on our intuition gained from experience in 
designing computing machines. We felt that the 
inhibitory synapses should be separated from the 
excitatory synapses and should follow different 
rules. Appendix 1 describes some detailed con- 


siderations of the transmission of activity from 
cell assembly to cell assembly. 


‘We then consulted again with P, M. Milner 
and learned that he had just produced a further 
revision of the theory that had just this property 
of synapses with differeing characteristics. His 
new model appears also to have the character - 
istic that the cell assemblies would be much more 
diffuse than in the FM Model described here. This 
would correspond better to what is expected in 
the brain and would make a better machine be- 
cause one cell assembly could directly affect a 
larger number of others. It is not within the 
scope of this paper to discuss this new scheme 
because we have not yet reduced it to our term- 
inology and tested it. However, the work is pro- 
ceeding. 


Summary 


The first set of experiments, designed to 
test parts of the theory advanced in The Organi- 
zation of Behavior, by D.O. Hebb, simulated a 
network of 69 neurons with a "Discrete Pulse 
Model.'' This set of experiments clearly illus- 
trated the diffuse reverberation that is advanced 
as an explanation of short term memory. There 
was, however, no tendency for neurons to group 
into cell assemblies. 


The second set of experiments were designed 
to test P. M. Milner's revision of Hebb's theory 
with an "F,M. Model" which kept track of the 
frequency of firing of 512 neurons but ignored 
the precise timing of individual firings. Cell as- 
semblies formed and exhibited the "fractionation! 
and "recruiting" required by the theory. The cell 
assemblies, however, were not able to arouse one 
another, so this model was too heavily dominated 
by environment. 


A third set of experiments is in progress. 
It is hoped that this set will get around the next 
major obstacle in producing a model that will do 
what the neurophysiological theory requires. 


This kind of investigation cannot prove how 
the brain works. It can, however, show that some 
models are unworkable and provide clues as to 
how to revise the models to make them work, 
Brain theory has progressed to the point where it 
is not an elementary problem to determine whether 
a model is workable. Then, when a workable mod- 
el has been achieved, it may be that a definitive 
experiment can be devised to test whether or not 
the workable model corresponds to a detail of the 
brain. 


88 


Appendix 1. 
The Interaction of Cell Assemblies 


Suppose that all synapses within a cell 
assembly are excitatory and that both excitatory 
and inhibitory synapses go between all assem- 
blies. Suppose also that the effect of stimula- 
tion at a synapse rises suddenly when the pre- 
ceding (afferent) neuron fires, and then dies out 
more slowly. For example, 4 model of this 
could be a chemical transmitter that was dis- 
charged on the stimulated (efferent) neuron and 
that was destroyed at an exponential rate. Sup- 
pose also that the effect of an excitatory synapse. 
fades more slowly than the effect of an inhibitory 
synapse. In terms of the chemical transmitter, 
this could mean that two different chemicals 
were used for inhibition and excitation, and that 
these were destroyed at different rates. Finally, 
suppose that the total inhibitory stimulation of an 
aroused cell assembly on a quiescent cell assem- 
bly dominates the total excitatory stimulation. 


While a cell assembly is firing actively it 
will suppress its neighbors. However, when its 
neurons tire and it begins to falter, the inhib- 
ition will drop more rapidly than the excitation. 
When the level of inhibition drops below the level 
of excitation, switching will take place. 


This sort of interaction between neurons is 
being built into the third set of experiments. 


Appendix 2 
Equations Describing the FM Model 


The structure of the net is given by 


j = g(h, i) 
where i is the number of the efferent neuron, 
j is the number of the afferent neuron, and h 
is the number of the afferent synapse for the ith 
neuron, g (h,i) is determined at the beginning 
of an experiment, and remains constant. 


The following quantities for alli, j deter- 
mine the state of the model at any time t. 


Symbol Number Description 
of Bits wD: 
x (i,t) 4 frequency of neuron i at timet 
sts (Gh 13) 4 average frequency of neuron i 
at time t 
d (i,t) 3 fatigue of neuron i at time t 
se {Ole 15 1) 8 magnitude of the synapse at 
time t coupling stimulation from 
neuron i to neuron j 
R (i, t) 8 a function of x(hjt), sxclt, t-te 


Initial conditions for the net are given by the 
values x{i,0), x(i,0), d(i,0), r(i,j,0), and 
R (i, 0). 


The quantities S(i,j,t) and x'(i,t) are inter- 
mediate results in the calculation. A single 
time step consists of the successive evaluation 
of the following formulas: 


1) S(i, j, t) =f (i, j,t)y R(i, t) R(j,t) 
2) R(i,t+1) = (1 - 2) Rot) + (ti, t) -BG,t))" 


] 2 
R(j,t+1) = (1 - gm) R(j,t) + (x(j,t) - XC, t)) 
m = 32 


9 S$ (i,j,t#1) = (1- 4) Sli,j,t) 
+ (x(i,t)-%(i,t)) © (x(j, t)-x(j, t)) 
m = 32 


4) r(i,j,t+1) = S(i,j,t+}) J/R(i, t+ 1) R(j, t+ 1) 


5) We define p (i,t) = j such that r(i,j,t) 2 9, 
and = gq (i,t) = j such that r(i,j,t) < 0, 
including only values of j such that the 
synapse (i,j) exists. 


Then 
x (t+ lL) = 
=. r(i,j,tt1) +k,|x(j, t) 
ie (i, t+ 1) | 1 
fe) 
ae r(i,j,tt1)+k, 
p(i,t+1) 
r(i,j,t+ 1)-k,[x(j, t) 
= (i, t+ 1) | he 
r(i,j,tt1)-k)) 
q(i, t+ 1) 


opel 2570 myer S12 


OG) (ie teal) emit (di), t)), na, tL) 


Nou h WN HO 


ou hk WN eH OO 


0 
1 
2 
3 
4 
5 
6 
7 


89 


7) 
an externally controlled value, when 
the neuron i is a stimulated recep- 
tor 

xfi, t+1) = 


X(x! (i, t+1),d(i,t+1)) when neuron i 
is not a stim- 
ulated receptor 


Table of x(i,t) = X(x' (i,t), d(i,t) ) 


Santry uh WN eH OF; O 
SOON TU WN & Ole 


ewoaonoah bwW Dh & O]|N 
COW MDW WAIN Oh Hh WN DED B&O] W 
NNNDOOOUYN UT SP PW wee & O];P 
DADAM N MM WWD Be eS OW 
Ph WW WWNN ND DB B&B Be OLD 
NNNN KR RE Bee Re Pe DOO ON 


_ 
nS 
_ 
Ww 
_ 
Wo 

— 


8) x(i,t+1) =(1 - =) (i, t) + wli, t+ 1) 


References 


l. Brink, F. Jr. ''Excitation and Conduction 
in the Neuron" and "Synaptic Mechanisms". 
pp. 50-120 in Handbook of Experimental 
“Psychology, Ed. by S. S. Stevens, John 
Wiley and Sons, Inc., New York; 1951. 


2. ‘Eccles; J. G., The Neurophysiological 
Basis of the Mind, Oxford: The Clarenden 


Press, 1953. 


3. McCulloch, W.S., and Pitts, W., "A Log- 
ical Calculus of the Ideas Immanent in 
Nervous Activities", Bull. Math. Bio- 
Physics, vol. 5, pp. 115-133, 1943. 


4, Kleene, S. C., ''Representation of Events 
in Nerve Nets in Finite Automata", in 
Automata Studies, Annals of Mathematics 
Studies, No. 34, Ed. by C. E. Shannon 
and J. McCarthy. Princeton: Princeton 
University Press, 1956. 


5. Hebb, D.O., The Organization of Behavior, 
New York: John Wiley and Sons, Inc., 1949. 


6. . Ref. (5), 5 62, the present discussion, 


7. Ref, 2, and notice that Hebb's postulate 8. Craik, K.J. W., The Nature of Explanation, 
(Ref. 6) is not necessarily related closely Cambridge: The University Press, 1952. 
to Eccles "post-tetanic potentiation", On 
p. 196 Eccles shows the effect of a million 9, Farley, D. G., and Clark, W. A,, Pro- 
volleys (Fig. 6A, 36 minute curve) and this ceedings of the Western Joint Computer 
is much more severe than is relevant for Conference, 1955, 


Fig. 1 = Allocation of memory. 


TO © 


900 


zm fz 
eee 
YD & 
< R 
zqéz 
= Ss 


800 


700 


600 


500 


400 


THRESHOLD 


300 


Fig. 2 - Threshold curve. 
200 


100 


TIME STEPS AFTER FIRING 
23 S2esl7 (O05 Aly 25 5 4 Gan aC 


90 


R 90 


401 
£2 
603 
&O4 
“05 
406 
407 
406 
409 
610 
11 
412 
413 
416 
415 
416 
617 
418 
410 
&20 
&21 
622 
423 
424 
425 
£26 
627 
428 
429 
* 30 
431 
$32 
433 
43% 
35 
&36 
&37 
38 
439 
440 
&&} 
642 
443 
4a4 
4&5 
466 
447 
448 
449 
450 


R 90 


coocooocoocoocococecooceoeceooocecoocoocooocoooooococooc 


THRESHOLD RANGES 
FROM 256 TO 938 


10 OUTPUT SYNAPSES 


Fig. 3 = Example of a simulated neu- 
ron and its connections. 


7 INPUT SYNAPSES 


. 1 1 z 1 ' ! 1 ! ! 
9000000000 1001000000000040000000000000000000000000001 0010000000 L00LOU0OUUL | 0000 1 U1 OUOOUUOULO0L0N I UO I | 00000! C0000 
00000000000 1010000000000000000060000000000000000010100110000011 0000000 | OULLOULOOOUUY I ULUYOULU | LOULLOU000000 1 CO0U 
90000040000 11000000000000000000000000000000000000000090000010000! 000i 001 | VON! VDBOLLO000 I OVLLE | VOLLULUOUD | 0000 1 1 10 
0000000000 11110000000 0U600L40000000000000000000000100000110100 1 VOOLOLOOU I | | VBUULLOOO | 01.01 GOI OOOOUU | 1 000 1 OOOLOCOOU 
90000000001010100000000060000000000000000000000000101010000000000 1 1 101 VOVOOLU! 1001 1 LU0L01 10) 001 COLOOU! 001 1001041 
9000000000 101000000000000000000000000000000000001100010010000100! 00000 | 4 41101001 1 COOLOLOLOLOOLO } | OG! 100000100100 
900000000010010000000000000000000000000000000000000 110110011001 1 VOOULLGU0LY | LOLULUU F111 1 UYU) | ULOULLUGOU0UV 1 00 1.0 
90000000000 14100000000000000000006000000000000000000000000001 00001 01 VOOUOLOUULUU I LOLULU | BUUELLUU } 1 VOD 1 | UO00 1 0OG 
90090000000 1 000000000000000000000000060000000000000000000000000U1 0 | VOOUGOUN | | VULLELVLEOULUY | 0.0L | VOUUL0L00 | | COCG00 
90090000000 1001000000000000000000000000000000000001 0000001 000000 VUOUU | VULLLOOU I LUV I UULULIULOULUGOUOU | | 0000 1001 00 
06000000000 100100000040000000060000000000000000001000101100000100000000000000000 | COOLULUL | V1 LOOOG00 | CBLEOEG0EECE 
9000000000010100000000040000000000000000000000000001000000010100V00000 | OU! VOLLOLLLU | | LUOLVOL0U 1 CL 1 OY00000000 1 0000 
00000000000 11000000000406000400000000000000000400000101000000000 I LOLLON 1 1001 ULLOU! GULL) VOULY | VULOV | GUGL0000000 1 I 
90000000000 1110000000100U0000000600000000000000100100000010100 101 LU} LOLU 1 LYOVVLOLLLLLUUOUU I 00U | 00 | BLOT | CO0000 1 OO 
90000000000 11000000000000000002000000000000000000000000000000000001 | OON0L0U0 1 § 10001 LO 101 1 00006000000 1 G00 1 0001 000 
00000000000 1011000900000000000000000000000000000011001000100001 0V0000U | GLU 1 YOULL | VUELLOLVELLVULOLOUI | 00000 | 100000 
90000000006 11010000000000000000000000000000000000000000100010100U) GO000 | UUU 1 VOLYLLU) 1 LULU 1 UO 1 1.01 VOULOOOLOU0 | GGOU 
000000000010000000U0U10V000000000000000000000000000000 1010100000) 000! 0OU 101 VOVOOUI YOVU! BUOUVOLLU I UI OCU t 100000111 
9000000000100110000000000000090000000000000000001002100100001001000 101 COOULY I COVOOLOD I 01 101 V00 | GOGGLE 1: 0001.01 100C 
90000000000 1101000000100000000090000000000000000000000000000000000! 0001001 LOO} 1 1001 VOOOOOOU 1 LO000 10 | OOOO | 01001 00 
00900000000 111100000000000000000000000000000000001901100000001101 101 0001 LU! OOOLU! VL! VUDOUVOU I | OOOVEVU | COOD000000 
00000000001000100000000000000000000000000000000000 11000000 111000 L000000V 1 UU} | LUOUUO I U1 VO 1 DOOUOU 1 VOL0O0000N I 11 FI 
0000000000 10001000000000000000000000000000000000 1 000002101 000000000000000 1 0001 1001 100101 0Ot COOVCOOL! O10 101 | COOGOO 
90000000001000100000000000000000000000000000000000001000100000010! 1011 1 O0Ut VOOVLLVLOUGOUUGO I VOI | GOO 101.0100! | 0000 
900000000001 11000000000000600000000000000000000000000 10000000110! 00! OULU! UUY! VUOLLVGLLUULUU I | VO | VOU | VUCOODD | OOOO! 
90000000000 1101000000 10L00000000000000000000000000 11001000 110000 00000000 I Luu 1 VULULU! ULI VUOVLOOUUY I VOI OUGO0000 I 10C 
900900000000 1 1000060001000000000000U000000000 000000000001 00001 000 vLV00OLUVLLUUUUUU I I 1U 1 101 GOUVOVOOUU! U1 0 1 OOO0000 10 
90000000000 1100000006 40000000000000000000000000010000000000000000! 0001 001 1 OU) I | UUVUVUOVOLOLU | VULEVU 1.001.001 0CO0O 
09000000000 1111000000000000000000000000000000000 100011000000010 11 0010101 00U! DOUVDVOLUVELULEND t 1.01 COOVOD0ND0 I OO000t 
9000000000101000000000000000u0000000000000000000001100101111001 00000! 0001 U1 VCOOVOLLO! VOLO! OUOLOOD 1 101.001 000100110 
9000000000101010000000000000000000000000000000000000000100001 0000! t 1001 00000! 0000110101101 1001 C000! 01 1 0GOt O! 1000 
00000000001000000000000000000600000000000000000001 00000010000100! 000000 1.01 L0G! 101 LOVOLODLOD | 101 GO00O00N | 100001 10 
00000000000 10T1000000000000000000000000000000000 1001110000100000 LOLGOGOOUY | 1 GUL! ULLOO | VOVOOLOUON 1 U0 | (00000000000 
90000000000 10000000000.000000000000000000000000000000001000000000V0) 00 I | FUVULOLLVLGY 1 VUVULOUUU I V4 UYOVLO0000000000 1 
900000000000 111000000000000000000000000000000000000000000000000 11 G00! 000 | OUVVOLUULGODULVOULUDOUOUULUL000000 10 10:10 
00000000000 10010000006000000000000000000000000000000000001 001010000! 00 | VOLLVO! LYVULLOVO I 0 | VBUEVVVOUOD | GO0000000 
00000000000 1611000000400000000600000000000000000001 00000000100000000000001 00 1 1 VOU I 1 LOO! UVF GOOGOEOU! VOODOO 1 OOCODL0Y 
900000000610000004000000000000606000000600000000 110110110000000001 OVOOOOLLLVULUGLU I ULL! VOUGLN | VOT L010 1 § 1 00000} OOO 
000000000016 000000000000UG10L000000000000000000000000100 1010000000) GOO 1 1 001 | VOVOVUO) 01 BEQOUD t | OL | VUGOLO 1 OOVOD 110 
20000000000 111000000000000606000000000000000000000110000000001001 001 1 1 00) VOBOOLOLOLLGOUI | OCVVVEOLUVOLOG 1 091 001001 
90000000000 10110000000000000000000000000000000000000000001 00100 1 0000000000 | 01 001 1 LOGEU! 0001 OOO I GOOLOV0000 | OOLV0O 
00000000000 1010000000006000000040000000000000000010000010000001 0VOVLOYOLU! VUL! 1 VOLT QOOOVU | OVVLGOLY! YODLO | OOOGOVO 
90000000000 10000000006 000000000000000000000000001000100000010000 LOOOVOUULEUULULVUY | VOUULUUUY | UULVOUU 1 0 | 00000 1 GO00 
98000000000 100100000000000000000000000000000000000000000001000000! 1 000 1 | COVOVULULLYVL! CUEYOUOU I 1001 GUOCOOOOOOLV000 
00000000001 90000000000000000000000000000000000000000000010000100! 00! OVO0 1 LU | VLUUVVLELOU! UI 1 VO1OGUI 000! OLOO00O! TI 
90000000001011000000010000000000000000000000000000 1100110100101 0000001 00011011 10001011 BOF VOOE) BYOOOO T0011 1 C00 
00000000001 1010000000 000000000V000000000060000000000100000000010 11 t 11 OOLVOOLOU! 1111 VEGOOL1 U1 VOLUT 1010110001010 
9000000000191000000600000000000000000000000000011011000000101001 000000 1 10001 01 VOOU! BOLO T 1 0U101 GOLLY 101 CODLOD00L 
900000000001 1010000000000000000000000000000000000000001 000101000 LYOGELOOULU | VLOLLELOOLO 1 1 | VOCOLOYU 1 100000001 010110 
epopesgo[00149070090000000000000000000000000000000000001 10000000001 ( oooouUoYLODo LY UuoUOOUSUUL LOCI OOD CUED OSTEO! 
1 1 1 ! ' 1 { ! 


Fig. h - Firing pattern of 6 neuron for 50 time steps showing diffuse reverberation. 


91 


R90 


20 
7 10 : : 
ac Fig. 5 = Number of neurons firing 
iL at each time step showing 
a 0 diffuse reverberation. 
oe 4 4 20 40 60 80 
Ce _Yyr-’ 
STIM. TIME STEPS 


THRESHOLD 


Cn meme a 


LEVEL OF STIMULATION —| 


Fig. 6 - Threshold as a function 
of time. 


THRESHOLD 


| 
| 
| 
TIME STEPS | 


a ic A ete) SE Na al dc lk beg el feed eg ee a fle) 
012 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 


| 
| 
| 
BOOm ane | 
| 
| 
| 
| 


SEQUENCE 1 (8 SIGNALS) 
3 SEQUENCE 2 (8 SIGNALS) = 
SEQUENCE 3 (8 SIGNALS) 


80% 
CHOOSE A RANDOM SIGNAL 


DECIDE WHAT 


TO DO NEXT 
USING A 
RANDOM 
NUMBER 


Fig. 7 = Environment. 


92 


Total Neurons Firing Different Neurons Firing 


Time Cc ray Bo 2 

Step ntl. Run N40 Sup. N61 Sup. N70 Sup. A?C B-C D°C 
151 31 30 30 30 1 1 1 
152 30 30 28 28 0 2 2 
153 28 29 27 27 1 5 5 
154 30 31 29 31 1 2 5 
155 31 31 28 30 2 19 9 
156 33 34 26 30 3 at 17 
ee 30 29 31 31 5 37 19 
158 31 26 32 29 9 37 16 
159 32 30 32 34 14 32 22 
160 34 32 33 38 20 3a eZ 


Three separate runs are represented in this chart in addition to the 
control run, C, In run A, neuron 40 was suppressed; in B, N61] was suppressed, 
and in D, N70 was suppressed, 


Fig. 8 - Divergence after suppressing one firing of one neuron. 


Fig. 9 = Arrangement of neuron in fm model. 


" ET 


\ 
\ 


Fig. 10 =~ Illustration of cell 


assemblies. 


00 


00 


10 


00 


Ol 


06 


00 


07 


07 


00 


00 


04 


00 


00 


00 


00 


50 


10 


10 


00 


05 


30 


01 


05 


00 


00 


00 


00 


00 


05 


04 


05 


00 


06 


07 


00 


00 


00 


01 


00 


00 


00 


11 


00 


00 


00 


00 


00 


00 


00 


50 


10 


00 


01 


00 


07 


00 


00 


00 


00 


00 


00 


00 


00 


00 


20 


20 


00 


01 


00 


70 


00 


30 


00 


00 


00 


01 


00 


00 


00 


00 


00 


50 


00 


00 


00 


00 


00 


00 


01 


00 


00 


5Q 


00 


00 


00 


00 


03 


10 


03 


00 


10 


00 


07 


07 


00 


00 


00 


00 


00 


05 


00 


THE MEASUREMENT OF THIRD ORDER PROBABILITY DISTRIBUTIONS 
OF TELEVISION SIGNALS 


W. 


F. Schreiber 


Research Department, Technicolor Corporation 
Burbank, California 


Summary 


A device has been built for the rapid, 
automatic measurement of the third order proba- 
bility density of video signals. Special cathode 
ray tubes are used to perform a 64-level ampli- 
tude analysis on each signal. A triple coinci- 
dence is formed among the three analyzer outputs, 
the number of occurrences in a frame interval 
being stored in an 8 Mc counter. These numbers 
are recorded on magnetic tape which then becomes 
the input to an electronic computer. The com- 
puter calculates the conditional entropy, i.e., 
the information generated by one of the signals 
when the other two are known. Examples are pre- 
sented of second and third order distributions, 
and of entropies calculated for a variety of 
scenes. 


x Introduction 

One standard television signal occupies 
four times the bandwidth allotted to the entire 
A-M broadcast service. This is undesirable for 
many reasons. In the first place spectrum space 


is in short supply with many communication facil-_ 


ities competing for frequency assignments. Sec- 
ondly there are some applications of television 
which are made considerably more difficult on 
account of the wide bandwidth, for example, re- 
cording. Finally, there are other television 
applications which are made quite impossible, for 
example, long distance wireless transmission for 
both military and civilian use. It seems worth- 
while to investigate methods by which the band- 
width might be reduced. 

There are two categories of techniques 
which might be called upon to reduce the band- 
width of television transmissions. The first 
depends upon the psychology of vision. Such 
techniques involve removing from the television 
signal certain information not required for a 
satisfactory image to be reproduced, and to do 
this in some way which results in a bandwidth 
compression. Vertical interlace as used in com- 
mercial television transmissions is an example 
of this kind of bandwidth reduction. It results 
in a full definition picture with half the band- 
width otherwise necessary. Other schemes of this 
type have also been proposed, such as Toulon's 
suggestion for "knight's move" scanning.! The 
second category of bandwidth reduction techniques 
depends upon information theory; that is, these 
techniques attempt to exploit statistics of the 
television signal so as to remove redundancy and 
to code the signal in such a way that the trans- 
mission bandwidth is reduced while at the same 
time the picture presented to the observer is 


94 


essentially unchanged. Again there have been 

many suggested schemes but as yet none has been 
instrumented. Now just as techniques based on 

the psychology of vision must depend on carefully 
made psychophysical experiments, methods depend- 
ing on the statistics of the signal must have as 
their basis a detailed knowledge of the statistics 
of television signals. This paper describes a 
machine for rapidly and accurately measuring those 
statistical parameters of television signals which 
are useful, first for the estimation of their sta- 
tistical information content, and second as an aid 
in the design of coding systems. Some results of 
these investigations are given. 


Choice of Statistic to be Measured > 


There are about 200,000 picture elements 
in a standard television frame, and if 32 bright- 
ness levels are allowed, then 32200,000 gifferent 
pictures are possible. This number at once points 
out the unnecessarily high capacity of the exist- 
ing television system and the effort that would be 
required to determine the complete statistical 
description of television signals. Such a de- 
scription would entail the measurement of the 
probability of occurrence of each possible pic- 
ture, individually and in sequence. Clearly a 
less complete statistical description, neverthe- 
less useful, must be found. The choice is wide, 
and a selection is dictated by the use to which 
the data will be put. 

Inspection reveals that the principal re- 
dundancy in television signals, and consequently 
the source of channel economies, lies in the rela- 
tion among the amplitudes (or brightness) of 
neighboring picture elements within a frame, and 
among corresponding elements in successive frames. 
Correlation of this type has already been exploited 
to reduce the average power required for picture 
transmission.2 This leads us to believe that the 
appropriate statistic to measure is the joint am- 
plitude probability distribution of these related 
picture elements. An nth order distribution per- 
mits the calculation of an nth order upper bound 
on the information content. This in turn sets a 
limit on the maximum bandwidth compression possi- 
ble using a coding system which utilizes just the 
statistical relation among the neighboring ele- 
ments. If, in fact, a significant statistical © 
relation existed among, say, five such elements, 
we would want to measure a fifth order distribu- 
tion. If we measured a lower order distribution, 
we would then calculate a looser (higher) upper 
bound on the information content, and would be 
more pessimistic than necessary about the pros- 
pects for bandwidth reduction. 

The fact is that we do not know how many 


nearby elements are statistically related. There- 
fore we shall measure as high an order distribu- 
tion as present techniques permit, and compare 
the results with lower order measurements. That 
is the principal reason why we have chosen to 
measure third order distributions. However, 
there is an additional reason which may be more 
compelling. This is that it appears to us that a 
practical coding system (at least the first prac- 
tical coding system) ought to use the same code 
for all pictures. To do otherwise would require 
a great deal of additional equipment. Therefore 
the statistic worked on ought to be one which is 
reasonably constant for the range of pictures en- 
countered in\television practice. There is a 
plausible argument that the third order distribu- 
tion is the highest such. The argument is this: 
For a picture of 64 allowable brightness levels, 
a third order distribution is a tabulation of the 
probabilities of occurrence of each of 64% combin- 
ations of brightnesses. This is about one quarter 
million combinations. Since there are only about 
200,000 elements per frame, each of these combina- 
tions will occur on the average about once per 
frame. Due to the nature of pictures, most of 
these combinations will never occur, and some, 
for example, those indicating three equal bright- 
nesses, will occur rather frequently. However, 
there will be some combinations which occur in 
typical pictures once or twice. Obviously, a 
reliable estimate of these probabilities is not 
obtained. The measurement of fourth or higher 
order distributions would result in a much larger 
proportion of such combinations in which the sta- 
tistical evidence was very unreliable. 

Finally the third order distribution will 
actually give us a great deal of information. It 
will, for example, settle once and for all the 
question of whether the generation of picture sig- 
nals is a second or third order statistical pro- 
cess. For example, if the only statistical influ- 
ence which exists in a picture is between the 
brightness in adjacent elementary areas, then the 
third order approximation to the entropy will be 
equal to the second order approximation and this 
is a point which can be resolved by this measure- 
ment. 


Technique of Measurement 


The amplitude probability distribution of 
a stationary time series (i.e., a sequence of 
numbers or pulses) is measured by means of count- 
ing*the number of occurrences of each possible 
value of the variable for a long time and normal- 
izing by dividing by the total number of pulses 
or numbers. For a periodic time series, it is 
necessary to count only for one period. The dis- 
tribution of a continuous variable may be mea- 
sured in two ways. The first way is to measure 
the proportion of time during which the signal is 
found within each small amplitude interval, for 
example by blackening a phopographic emulsion in 
proportion to these times, 354 or by averaging 
a current which flows only when the signal is in 
the interval. This is the method used previously 
by the author in measuring second order distribu- 


95 


tions.° The advantage of the measurement by pro- 
portions, which may be called a continuous mea- 
surement, is that the equipment can be relatively 
simple. In addition, it is generally possible to 
make a simultaneous measurement of the probability 
for all of the separate amplitude intervals, and 
thus obtain the entire distribution very quickly. 
On the other hand, this method is also character- 
ized by a limited dynamic range. The ratio of 
maximum to minimum probability which can be mea- 
sured reliably is ordinarily limited to about 
100:1 by stray light and/or the characteristics 
of photographic materials. 

A second way to measure the distribution 
of a continuous variable, assuming it is band- 
limited, is to create a time series from it by 
sampling at a rate high enough (twice the highest 
frequency component) to include all the signifi- 
cant fluctuations, and then to measure the dis- 
tribution of the time series by a counting tech- 
nique. The advantage of this digital method is 
that, unlike the continuous technique, the dynamic 
range of the measurement is unlimited by uninten- 
tional factors, and is set solely by the number of 
events (picture elements) in a period. On the 
other hand, the circuits required to produce and 
process the sampled video signals are complex, and 
the counter, which for television signals must 
operate at a maximum rate of 8 Mc/sec., is an ad- 
ditional complication not present in the continu- 
ous case. Furthermore, unless there is provided a 
whole set of expensive counters, it is necessary 
to use one period to measure the count in each 
amplitude interval, or, in our case, for each of 
the 643 combinations of amplitude intervals. 

Despite the added complexity of the digital 
measuring method, it was selected on account of 
its higher accuracy. There seemed to be no advan- 
tage in using multiple counters to speed up the 
recording of data, as only about 2 1/2 hours is 
required to record an entire distribution. Each 
count is recorded on magnetic tape for later use 
directly as the input to an IBM 704 computer for 
calculation of the entropy. 


Description of the Apparatus | 


The probability machine consists of two 
basic parts. The first is a flying spot scanner 
which generates stable, high quality video signals 
from slide transparencies. The other part con- 
sists of the circuits necessary to measure and 
record the probability densities of the video sig- 
nals produced by the scanner. , 

The key operations in the measurement of 
the probability distribution are sampling the sig- 
nals at 8 Mc/sec, and then determining in which of 
the 64 amplitude intervals each sample belongs. 
Both of these operations are performed in a simple 
and elegant manner by a special cathode ray tube, 
called a switch tube, the design of which was 
evolved jointly with the tube manufacturer. Figure. 
1 is a drawing of the final model of the device. 
Because of the key role played by the switch tube, 
it is worthwhile discussing its characteristics in 
some detail. 


The Switch Tube 


Many investigators have used cathode ray 
tubes for probability measurements. The signal 
is ordinarily applied to one set of deflection 
plates and a measurement is made of the brightness 
of the trace at each point along its length. If 
the phosphor brightness‘is actually proportional 
to the average current density of the incident 
beam, then the brightness at each point is propor- 
tional to the length of time spent by the beam in 
the vicinity of each point, as desired. In gray 
wedge analysis, the brightness is measured photo- 
graphically. It is possible to eliminate defects 
in this measurement due to nonlinearity of the 
CRT deflection and due to non-uniformity of the 
phosphor by measuring the brightness photoelectri- 
cally and translating the trace past a fixed aper- 
ture, rather than vice versa. 

It appeared to be an improvement over this 
last method to use a special cathode ray tube 
which eliminated the steps of converting electri- 
city into light and back into electricity by per- 
mitting the direct measurement of the beam current 
through a physical aperture corresponding to 'the 
aperture of the optical arrangement. This elim- 
inates any error due to phosphor saturation, light 
scattering by the phosphor, and stray light. How- 
ever, the dynamic range of the measurement is 
still limited by stray electrons due to reflection 
and secondary emission. It is this consideration 
which led us to the decision to count pulses rath- 
er than to measure currents or brightnesses. This 
is done by applying to the cathode of the switch 
tube an 8 Mc/sec sampling pulse train. Then when- 
ever the beam is directed towards the aperture, a 
train of current pulses is produced in the collec- 
tor electrode, and these may be counted. For de- 
termining the amplitude interval in which a signal 
belongs, the voltage applied to the deflection 
plates is made equal to the difference between 
the video signal and a fixed reference voltage. 
Since the aperture is centered at zero deflection 
voltage, an output will be obtained whenever the 
video signal is equal to the reference voltage, 
plus or minus the amount required to deflect the 
beam to the edge of the hole. By setting the hole 
size to correspond to one sixty-fourth of the 
peak-to-peak signal amplitude, each signal sample 
will in principle produce an output pulse for one 
and only one of the sixty-four possible values of 
the reference voltage. By having the reference 
voltage take on each of its possible values for 
one frame interval at a time, and by recording the 
number of pulses through the aperture in each 
frame, a complete first order distribution is re- 
corded in 64 frames. A second order distribution 
may be measured by applying a second video signal 
and reference voltage to the other set of deflec- 
tion plates of the switch tube. Under those con- 
ditions, an output will be obtained only when 
both video signals match their respective fixed 
voltages, the switch tube simultaneously perform- 
ing the functions of amplitude selection and coin- 
cidence. The second fixed voltage then is made to 
change one step each time the first fixed voltage 
completes an entire cycle of 64 steps. 642 or 


96 


4096 frames are used for the second order mea- 
surement. Similarly, 643, or 262,144 frames are 
required for the third order distribution. For 
that measurement, a second switch tube is needed, 
and a coincidence must be formed between the out- 
puts of the two tubes in advance of the counter. 

The ideal operation of the switch tube as 
described above depends on having an electron 
beam of perfect focus, i.e., zero diameter. The 
effect of a finite diameter beam is to produce 
collector current pulses of intermediate ampli- 
tude when only a portion of the beam enters the 
aperture. To minimize the proportion of pulses 
which are so affected, it is desirable to have as 
high a ratio as possible of aperture width to 
beam diameter. Since the aperture width corres- 
ponds to 1/64 of the video amplitude, we use as 
high a video voltage as convenient to generate 
and use a physical aperture of corresponding 
size. However, once one has set a limit to the 
deflection voltage, it is still possible to se- 
lect operating conditions for the switch tube to 
maximize the ratio. The figure of merit of the 
tube for this purpose is the deflection sensibil- 
ity, i.e., the voltage required to deflect the 
beam one beam diameter. This is so because, hav- 
ing set the amplitude of deflection voltage, we 
have also set the voltage required to deflect the 
beam one aperture width. Therefore the desired 
ratio is maximized by minimizing the voltage re- 
quired to deflect the beam one beam diameter. 

We have measured deflection sensibility 
and have found that it increases with decreasing 
acceleration voltage in the gun of the tube. 
Apparently, the deflection sensitivity increases 
faster than the beam diameter, as the accelera- 
tion is lowered. The disadvantages of low volt- 
age operation are that the available beam current 
is low and the beam is susceptible to hum field 
deflection, but these difficulties have been 
overcome. We operate with about 500 volts accel- 
eration, 10 microamps beam current, and have a 
close-fitting magnetic shield over the entire 
tube. With these operating conditions and a de- 
flection voltage of about 200 volts peak-to-peak, 
plate-to-plate, the beam diameter is about 1/5 
the aperture width of .25 inches. 

One final consideration in the application 
of the switch tube concerns the shape of the cur- 
rent pulse in the collector. If the video de- 
flection voltage is not changing during the in- 
terval in which the switch tube is pulsed on, 
then the current pulse has the same shape and 
duration as the 8 Mc/sec sampling pulse, which 
is about 1/16 psec wide. If, however, the beam 
moves past the aperture in less than 1/16 usec, 
then the current pulse shape and duration will be 
governed by the sweep speed, and, furthermore, 
pulses will be received by the collector for sev- 
eral successive values of reference voltage, 
since, in the same 1/8 wsec period, the video 
voltage will be in several adjacent amplitude 
levels. Whether these pulses are counted depends 
on their width and amplitude, and is thus uncer- 
tain. To overcome these difficulties, the video 
signals are processed by "boxcar" circuits which 
introduce steps into the signals to hold their 


levels approximately constant for the active per- 
iod of the switch tubes. 

The operation of the probability machine 
can now best be understood by reference to the 
block diagram, Figure 2. It is convenient to con- 
sider first the generation of signals by the fly- 
ing spot scanner and then the measuring and re- 
cording of the probability distribution by means 
of the switch tubes and associated circuits. 


Flying Spot Scanner 


The function of the scanner is to produce 
three separate video signals from three transpar- 
encies. Usually the three transparencies will be 
identical and the three signals will be in effect 
derived from three spatially related spots scan- 
ning one picture. To achieve this result, the 
blank raster of a high intensity flying spot cath- 
ode ray tube is imaged by means of a projection 
lens and two beamsplitters onto three glass trans- 
parencies. The light transmitted by each trans- 
parency is collected by condenser lenses and 
spread evenly over a portion of the photocathode 
of a photomultiplier tube. This is shown schema- 
tically in the block diagram. Figure 3 shows the 
entire optical system. The scanner tube is at 
the left. The projection lens, which is mounted 
on the end of the beamsplitter housing, is parti- 
ally visible thru the open access door. The first 
beamsplitter reflects one-third of the light to 
the right, the remainder continuing on to the 
second beamsplitter, which reflects 1/2 up and 
permits the rest to pass thru. The straight- 
through and right-hand beams illuminate transpar- 
encies in movable holders, while the holder in 
the vertical beam is fixed. Each transparency is 
followed by condenser lenses and a phototube. 

In a properly designed flying-spot scanner, 
the main source of noise is shot effect in the 
photocurrent. Since the noise voltage is propor- 
tional to the square root of the current, the 
signal-to-noise ratio is proportional to the 
square root of the current. Noise is thus re- 
duced solely by increasing the photocurrent, i.e., 
by increasing the light incident on the phototube 
and by using the most sensitive photocathode 
available. We have attempted to meet these goals 
by the following methods: 


1. The combination of P16 phosphor and S4 
photocathode is the most efficient known. 

2. The scanner tube operates at the high- 
est voltage and current ratings in current engi- 
neering practice. 

3. The objective lens is the widest aper- 
ture lens available of this focal length (selected 
for other reasons) having adequate definition. 

4. The semi-reflecting mirrors are inter- 
ference beamsplitters of very high efficiency. 

5. The condenser lenses are made of Pyrex 
glass having good transmission in the spectral 
region of interest and are coated. 

6. Since for a given effective f number, 
the brightness of an image is independent of its 
size, the total light energy incident on an image 
is proportional to its area. Hence the largest 


97 


practical transparencies, i.e., 3 1/4 x 4", with a 
useful area of 2 1/4 x 3", have been used, result- 
ing in a signal more than 4 times larger than if 
Leica size transparencies had been employed. 

Further precautions are necessary to ensure 
that the three images are of equal size and high 
quality. Each beamsplitter is mounted on a very 
thin, flat support, called a pellicle, which con- 
sists of an organic membrane stretched tightly 
over a carefully lapped frame, in the manner of 
one-shot color cameras. The frames themselves are 
rigidly mounted in an accurately machined box. 

Glass transparencies are used for dimen- 
sional stability. All in a set are exposed one 
after the other in an electronically timed high 
quality enlarger, and developed at the same time 
in a hand-agitated rack so designed that each of 
the plates undergoes the same development. 

In the optical system the two adjustable 
slide holders are movable in such a manner that no 
difficulty is encountered in registering the three 
pictures. A sensitive means of determining when 
registry is achieved is to superimpose the sig- 
nals, in pairs, on a picture monitor. Micrometer 
adjustments, which can be seen in Figure 3, allow 
2 of the images to be displaced horizontally or 
vertically with respect to the third by measured 
amounts to derive the three signals needed for the 
probability measurement. 

In addition to the purely optical method of 
deriving the two additional signals needed for the 
measurement, delay lines of 1, 2, and 3 Nyquist 
intervals (1/8 usec) are used, so that there are a 
total of six different signals available for com- 
parison. 


Amplifiers 


The photomultiplier outputs are amplified 
and equalized for phosphor persistence and limited 
to 4 Mc/sec bandwidth in the preamplifiers ("'P" in 
Figure 2). The signals are then passed to the 
distribution and delay circuit where the three 
signals for analysis are selected from the six 
signals available. The three deflection ampli- 
fiers raise the signal level to the value required 
for the switch tubes. The monitor amplifier per- 
mits the observation of these three signals, in 
pairs, on the face of the picture monitor, as an 
aid in setting up the equipment. There is also 
a circuit available for taking cell-to-cell dif- 
ference signals. It uses a shorted delay line of 
one Nyquist interval round trip length, the line 
being terminated at its sending end in its char- 
acteristics impedance, so as to add to the input 
signal the inverted, delayed, returning signal. 

Another circuit not shown rectifies the 
deflection amplifier outputs and feeds control 
signals back to the photomultiplier tube power 
supplies for the purpose of holding the signal 
level constant at the switch tubes, even though 
the scanner tube brightness, the phototube sensi- 
tivity, or the amplifier gain should change. This 
arrangement holds the amplitude constant to within 
several per cent with a ten-fold change in gain 
anywhere in the system. At the same time it elim- 
inates the need for closely regulated photomulti- 


plier power supplies. 


Coincidence Amplifier and Counter 


Each switch tube output is amplified and 
applied to a diode coincidence circuit and then 
amplified again sufficiently to operate the first 
stage of a 17-stage binary counter. The switch 
tube output is adjusted by controlling the beam 
current so that when the beam is centered on the 
edge of its aperture, giving an output signal one 
half that obtained when the beam is completely in 
the hole, the signal at the counter input is just 
enough to operate it. In this way, a count is re- 
corded whenever the center of the beam is within 
the aperture. 

Each counter stage except the first is 
capable of operating at least twice as fast as the 
maximum rate at which it is called upon to operate 
in use. The maximum rate for the first stage is 
8 Mc/sec, and it is capable of 12 Mc/sec opera- 
tion. Cathode-follower coupled triodes are used 
in the first two stages, followed by 6 stages of 
amplifier-isolated triodes, and finally 9 medium 
power dual triodes. 17 stages are necessary only 
for pictures which are almost entirely a single 
brightness level, in which case a count is re- 
corded for each picture element in the one-bright- 
ness area. 


Staircase Generators 


These three circuits generate the 64-step 
reference voltages which are applied to the switch 
tubes along with the video signals. Each one in- 
cludes a 6-stage binary counter. The last stage 
of the first staircase generator feeds the first 
stage of the second generator and the last stage 
of the second generator feeds the first stage of 
the third generator. As shown in Figure 4, the 
staircase voltages are produced by having each 
"high" stage in turn cause a pentode to draw a 
carefully fixed current through a precision volt- 
age divider. By use of very high plate resistance 
pentodes, optimum operating point, a large amount 
of degeneration, drastic derating of components, 
push-pull operation, .05% resistors, and separ- 
ately regulated plate and screen voltage supplies, 
a linearity and long-term stability of about 1 
part in 1000 is achieved. "Utra linear'' cathode 
followers of a special design are used at the 
staircase generator outputs so that large capaci- 
tors may be driven, providing a low impedance bias 
for the switch tube clamping circuits. These 
cathode followers have a performance matching that 
of the staircase generators and provide low imped- 
ance outputs for both positive- and negative-going 
signals of large amplitudes. 


Readout Circuit and Recording Amplifier 


The seventeen stage binary counter which 
records the number of occurrences, in a frame 
interval, of the particular combination of video 
brightnesses being measured, receives no counts 
during the vertical blanking period, since the 
switch tube is cut off. During this interval, 


98 


the readout circuit interrogates the counter, 
stores the seventeen digits on 17 storage capaci- 
tors, and then resets the counter. During the 
ensuing frame, the stored bits are read out, six 
at a time, and passed to the recording amplifier 
for permanent storage on the six information 
tracks of the magnetic tape. The seventh track on 
the tape is used for timing, and is Supplied, by 
the readout circuit, with a continuous 90 cycle 
signal coincident with the information pulses, if 
they are present. 

Recording is by the non-return-to-zero 
method, in which the tape is saturated in the 
track, the direction of saturation being changed 
when it is desired to record a one, no change 
being made for a zero. This style of recording is 
reliable and easy to read, and produces easily 
visible marks when developed in a suspension of 
magnetic particles, as shown in Figure 6. (Com- 
mercially available as Ferroprint Magnetic Inking 
Solution) 

The format on the tape is exactly that nec- 
essary for the tape to be used as the input to the 
IBM 704 computer. Our tape recorder moves contin- 
uously at .45 inches per second, thus recording 
200 bits per inch in each track. 


Clock Pulse Generator 


This circuit generates a train of 8 Mc/sec 
pulses, starting at each horizontal synch pulse, 
and continuing throughout each TV line. It con- 
sists of a ringing circuit with feedback to keep 
the output constant. The pulses are used for 
sampling at the switch tube and also in the "box- 
car" circuits previously described. 


Cycling Circuit 


Because of the fact that data must be taken 
so fast and recorded as it is taken, it is neces- 
sary that all functions in the apparatus be com- 
pletely automatic. The cycling circuit controls 
all the other circuits for the 2 1/2 hour data 
recording period, and turns off the appropriate 
signals at the end. 

Before starting, the switch tubes are 
clamped off at their control grids, so that no 
pulses are transmitted to the counters. Clock 
pulses, insufficient to turn the tubes on, are 
continuously applied to the switch tube cathodes, 
and the video signals are continuously applied to 
their deflection plates. When data recording is 
to start, the tape is set in motion, the staircase 
generators are manually reset, and the "start" 
button on the cycling circuit is pressed. The 
next following 30 cycle pulse, obtained by divi- 
sion from the 60 cycle vertical blanking signal, 
opens a gate, passing blanking pulses to the 
switch tube and thirty cycle pulses to the first 
staircase generator. When, 2 1/2 hours later, the 
third generator returns to its "low" position, the 
gate is closed and recording stops. During this 
period, the blanking signal turns the switch tube 
on for the peak of each clock pulse which occurs 
during every active horizontal line. In addition, 
various spaces and extra marks are provided on the 


tape, as required by the computer. 

The thirty cycle pulse is also sent to the 
readout circuit, where it initiates the train of 
events described previously. 

Figure 5 is a general view of the appara- 
tus: The tape recorder is in the right hand 
rack, on slides, mounted beneath the recording 
amplifier. The next rack contains the three de- 
flection amplifiers, with meters to indicate out- 
put level. Immediately above is a monitor scope 
connected in parallel with the first switth tube, 
for the visual observation of second order dis- 
tributions. The three lowest units in the next 
rack are the staircase generators, above which 
are the cycling and readout circuits. The pream- 
plifiers and picture monitor are on the shelf 
above the optical system, while the space below 
the optical system is occupied by power supplies. 


The Computation 


The data taken are actually the number of 
occurrences, in a frame, of the particular com- 
bination of three brightnesses corresponding to 
the three staircase generator settings during 
that frame interval. If these data are divided 
by the total of all occurrences of all combina- 
tions in a run, they are true probabilities. 

That is, if n(i,j,k) is the datum when the stair- 
case generators are on their ith, jth and kth 
step, respectively, and p(i,j,k) is the probabil- 
ity of this combination, then 


p(i,j,k) = n(t, 45k) = n(i, j,k) 
64 64 64 N (1) 
ZAR OIE) 
i=l j=l k=1 


where we define N as the total of all the counts 
of a run. N is equal to the number of Nyquist 
intervals in a frame, or about 190,000. The third 
order .joint entropy, i.e., the information gener- 
ated by three picture elements, is 


64 64 64 
H(x,y,2) =s=- 2 Py = p(i,j,k) log p(i,j,k) 
i=l j=l k=l 


(2) 


where H is in bits if the logarithms are binary. 
Omitting indices for clarity, 


H(x,y,2) = -ZE5 « log = = 
1 (3) 
-= XDD (n log n - n log N) 
H(x,y,z) = 1 [w log N - Zzzn log n| (4) 
N 


which is a form particularly suited for computa- 
tion. The information on the tape is divided into 
records 2048 numbers in length, corresponding to 
32 cycles of the first staircase generator. Thus 
a second order distribution fills just two records 
and a third order distribution fills 128 records. 
The numbers are put into storage in the computer, 


99 


one record at a time. As the data enter storage, 
the sum of n is accumulated over each 64 values 
and over the record. The sum of n log n is accum- 
ulated over the record, the value of n log n being 
found from a stored function table containing the 
first 1024 values. After the record is in stor- 
age, numbers larger than 1024 are detected by com- 
paring the sum of n over the record in storage, 
effectively using only the first ten digits of 
each datum, with the running sum. Most records do 
not contain values higher than 21 » in which case 
the two figures match. If they do not match, then 
each value of n is tested to find those greater 
than 1024. Large values are computed using a 
scale-changing formula and interpolation, and the 
correct value of n and of n log n then effectively 
replace the incorrect values in storage, the sums 
being recomputed. 

The large-number procedure is also used to 
compute the 64 values of X log X for each pair of 
records, the X's being the partial sums of n over 
64 values, i.e., 

64 
= n(i,j,k) 
i=l 


X(j,k) = (5) 


The 64 X log X are summed and stored, the 
X's themselves being discarded. Finally, the 
large-number procedure is used to calculate the 
one value of Y log Y for the pair of records, Y 
being the sum of n over the two records, i.e., 


(6) 


At the end of each pair of records, we 
store the sum of n log n, the sum of X log X, and 
the one value each of Y and of Y log Y. At the 
end of the file, we print out these results for 
each of the 64 pairs of records, and in addition, 
calculate and print out N, the sum of n over the 
file and the sums of n log n, X log X, and Y log 
Y over the file. The one value of N log N is 
hand-calculated for each file. 

It is evident that the X's constitute a 
second order marginal distribution and the Y's 
constitute a first order marginal distribution. 
It is useful, therefore, to calculate the corres- 
ponding first and second order entropies: 


N 
H(z) = 2 [n log N- SY log 4 (7) 
N k=1 
1 64 64 
H(y,z) = “4 [n LogaNM=9d) aex LLog x] (8) 
jel k=l 


The most significant entropies are Hy ,z (x) 
which is the information generated by the single 
picture element x when the values of two other 
elements, y and z, are known, and Hz(y) which is 
the information generated by one picture element 
when the value of one adjacent element, z, is 
known. By Shannon's formulae, 


Hxy (2) = H(x,y,z) ad H(x,y) (9) 


Hy(z) = H(y,z) - H(y) (10) 
and these are found by subtraction of the entro- 
pies previously calculated. Note that the indices 
may easily be shifted around provided the relative 
displacements of the original scanning apertures 
are preserved. 


Measurements 
Second Order Measurements 


The second order distributions of a number 
of subjects of varying complexity have been mea- 
sured, In each case, our procedure has been to 
record three distributions for each subject, with 
the two video signals in register and then dis- 
placed first one and then two Nyquist intervals 
(1/8 usec). Results for two subjects are tabula- 
ted. A was our most complicated and B our least 
complicated picture. 


TABLE I 
Subject Displacement H(x,y) H(x) H,(y) 
A 0) 7538 5.65 1.70 
A 1 9.06 SO 3.36 
A 2 9.77 Sans 4.04 
B 0 Sovarl Mi 32 0.95 
B i 6.15 4.30 35 
B 2 6.64 LS 229 


The maximum possible value for H(x,y) is 
12 bits per symbol, and this would occur only in 
a completely random picture of flat amplitude dis- 
tribution. The maximum value of H(x) is 6 bits 
per symbol and this would occur only in pictures 
of flat amplitude distribution. H,(y) is mostly 
a measure of the intersymbol correlation, the 
effect of the first order distribution having 
been largely discounted by the conditional (i.e., 
previous value known) nature of this parameter. 
The minimum value of the conditional entropy is 
zero, a situation which occurs only when each pic- 
ture element is exactly like the previous one. In 
terms of the scatter pattern which develops when 
the two signals are applied to the two sets of 
deflection plates: of the switch tube and monitor 
tube in parallel, we would obtain a thin diagonal 
line, and when the pattern was quantized for mea- 
surement by the.staircase generators, only the 64 
diagonal elements in the array would give non-zero 
counts. Now in practice, due to non-linearity, 
noise, voltage drifts, and the like, we actually 
record counts in somewhat more than 64 elements. 
If, for example, all the counts that should be 
counted in one square are instead spread over two 
squares, the measured conditional entropy is one 
bit per symbol. Now as this is quite a small 
displacement, amounting to a noise of ‘something 
like 1/64 the amplitude of the signal, it can 
easily be seen that when we calculate for any pic- 
ture a conditional entropy around one bit, what we 
obtain is not the real information content of the 
picture, but a measure of the difficulty of pro- 
ducing and processing video signals to that accu- 
racy, at the present state of the television art. 

In our previous work at Harvard, the mini- 


100 


mum measured entropy for a 32 level picture was 
about two bits. As this was largely attributable 
to noise, we devised an approximate method of cor- 
recting the measured results to eliminate the 
effect of noise. In the present work, where we 
measure minimum entropies of about one bit with a 
64 level picture, we have decided against making 
any noise correction. We now prefer to look at 
the coding problem in terms of dealing with actual 
pictures, of high but realizable quality. It 
should be emphasized, therefore, that the figures 
in the table above and elsewhere in this paper, 
are actual experimental data, and are not cor- 
rected in-any way for the deviation of the signals 
from ideal. 

Another point worth noting about the data 
in the table is that H(x) is the entropy of the 
first order marginal distribution. As such, it 
should not vary as the displacement between the 
signals is changed, and in fact it is usually con- 
stant to within several percent, although natur- 
ally it varies considerably from subject to sub- 
ject, depending on whether the pictures have a 
broadly distributed range of brightnesses or are 
mostly dark or light. 

The entropy most important for coding is 
the information generated by one picture element 
when the brightness of the adjacent element is 
known. These values are underlined in the table 
and vary from 1.85 to 3.36 bits per symbol. The 
average of all the subjects we have measured is 
2.62 bits per symbol. From these results, it 
would seem that the amount of bandwidth compres- 
sion obtainable using only adjacent cell correla- 
tion is quite low, and that other relations must 
be investigated if a really efficient system is 
to be found. To this end we have also measured 
the second order distributions of difference sig- 
nals which are obtained by subtracting from each 
picture element the brightness of the previous 
element. Although this linear operation does not 
affect the entropy of the picture if all the pic- 
ture statistics are used, it may conceivably 
affect the low orders of approximation to the 
total entropy which we are measuring, 

A complication in this measurement is that 
our equipment handles just 64 brightness levels, 
but a difference signal has twice as many levels 
as the original, since differences may be both 
positive and negative. Therefore, in each case 
where we desired to measure a difference signal, 
we first measured the original distribution with 
the amplitude reduced to cover only 32 of the 64 
levels available. The resultant entropies are of 
course lower than those of the full amplitude sig- 
nal, because the distribution is less uniform. We 
have found that H(x) goes down something less than 
one bit, and H,(y) a little less than that. Where 
the conditional entropy is already very low, it 
decreases hardly at all. When differences are now 
taken on this half amplitude signal, a full ampli- 
tude signal again results, the first order distri- 
bution of which, as expected, is sharply peaked at 
mid-amplitude, representing no change from cell to 
cell. The first order entropy usually is about 
one bit less than that of the half amplitude sig- 
nal; the conditional entropy of the difference 


picture may be only slightly lower than the condi- 
tional entropy of the original, or it may be very 
much lower, and we have not measured enough pic- 
tures of this type to be able to give a final re- 
sult. A promising feature is that in many pic- 
tures, differencing appears to. reduce the first 
order entropy to almost as low a value as the con- 
ditional entropy of standard pictures. If further 
measurements bear this out, it means that coding 
on a symbol-to-symbol basis can be made almost as 
- efficient as second order coding, by means of dif- 
ferencing, with a significant saving in equipment. 


Third Order Measurements 


We have also measured a number of third 
order distributions. [It will be recalled that our 
computation provides for the derivation of the 
second and first order marginal distribution and 
the calculation of their respective entropies, as 


well as the calculation of the third order entropy. 


It is a valuable checking procedure, therefore, to 
compare the second order results from third order 
measurements with the second order results from 
second order measurements. A typical result is 
tabulated below, where the subject is B, our least 
complicated picture, and the scanning apertures 
are horizontally disposed and separated by one 
Nyquist interval spaces. 


TABLE TWO 
Measurement H(x,y,2) Hyy(z) H(x,y) H(x) Hy(y) 


eee Gee thoy) 
6.31 4.39 West 


2nd order -- -> 
3rd order 7.80 1.49 


The second order entropies in the two mea- 
surements are within about 3% of each other. Con- 
sidering that the runs in question were made at 
different times and that the equipment was moved 
to a new laboratory in the time between, this is 
reasonable precision. Another point worth noting 
is that Hxy(z), the information generated by: one 
picture element when the brightness of the two 
preceding elements is known, is significantly 
lower than when the brightness°:of but one preced- 
ing element is known. This indicates that picture 
generation is in fact a higher+than-second order 
process. However, the amount by’ which the infor- 
mation content is reduced by the:extra condition, 
though significant, is small. Unless‘ future: mea- 
surements show a decidedly greater reduction, it 
is a reasonable conclusion that the point of di- 
minishing returns has been reached in this direc- 
tion and that a coding system, based on third 
order statistics of horizontally disposed picture 
elements, would not produce sufficiently greater 
compression than a second order system to warrant 
its greatly increased complexity. This result is 
in accord with (but could not have been predicted 
from) Harrison's“ experiments in which he reduced 
the average power of transmission by linear pre- 
diction and found that slope prediction, where the 
two preceding elements were taken into account, 
was not substantially better than the simple pre- 
diction that each element should have the same 
brightness as the one preceding. 


2 


Typical Distributions 


In addition to having our data computed, it 
is also possible to have it printed out. At first 
thought it might seem that the volume of data 
would be so large that this would be a very awk- 
ward procedure. However, most of the combinations 
of brightness studied never exist in the picture 
and so much of our data is zero. The print out 
procedure skips blocks of zeros, thus compressing 
both the time required for print out and the bulk 
of the paper. 

In Figure 7, we have plotted three (of the 
64) "cuts" through a second order distribution. 
They are each. sharply peaked at j=k, where the two 
brightnesses are equal. The curves are generally 
bell-shaped and it is characteristic that the 
width of all the cuts in a pattern is about the 
same. In Figure 8 we have hand=calculated the 
two first order marginal distributions from a 
printed out second order distribution, to check 
whether they matched as they should. In fact, 
they match quite well except for what seems to be 
a slight amplitude non-linearity in one of the 
signals, which would not have much effect on the 
calculated entropy. Figure 9 shows a few cuts 
through a third order distribution. These have 
much the expected shape. The fact that the peak 
is not exactly at i=j=k is due to a slight differ- 
ence in amplitude among the three signals and 
again would have little effect on the entropy. A 
feature of these curves is the rapidity with which 
they decrease from maximum value for other combin- 
ations of brightnesses. It is also no accident 
that the peak of each curve for successive values 
of j nearly coincides with the height of the adja- 
cent curve at the next value of k, since this is 
merely an indication of the symmetry of the dis- 
tribution in j and k. 

Figure 10 consists of photographs of the 
second order distributions whose entropies are 
given in Table 1. They spread out with succes- 
sively larger displacements, as expected. The 
similarity of the two channels and the noise level 
mdy also be judged by observing these patterns. 


Acknowledgement 


Thanks are due to D. H. Kelly, who inter- 
ested the company in this work, and to John R. 
Clark, Jr., Technicolor Vice President, for his 
support of the project. G. T. Inouye and C. 0. 
Carlson assisted in design problems and the latter 
has been in charge of operating the machine. M. 
Reinhard did all the mechanical design. E. H. 
Borchardt, R. S. Hiatt, W. Cahn and W. Gismot con- 
structed the electronic circuits. Frances House 
did computations, drew graphs, and typed the paper. 
A. E. Mann conceived the method of machine compu- 
tation and did most of the detailed programming. 
We have received excellent cooperation from IBM 
personnel at all levels, and in particular from 
B. 0. Evans, who helped with our tape problems, 
and Helene Steinman and Mrs. M. Levine, program- 
mers, who had the additional job of guiding our 
"foreign" tapes through the intricacies of the 
Electronic Data Processing Machines. 


References 
P. Toulon, L'Onde Elect. Vol. 28, p. 412, 
1948 


C. W. Harrison, Experiments with Linear Pre- 
diction in Television, Bell System Tech. J., 
Vole Si, Now 4, p. 764, Sulyw952 


E. R. Kretzmer, Statistics of Television Sig- 


ELECTRON GUN DEFLECTION PLATES 


“AQUADAG", CONNECTED TO ANODE 


“AQUADAG™ CONNECTED TO CUP 


nals, Bell System Tech. J., Vol. 31, pp. 751- 
TOS July O52 


W. Bernstein, R. L. Chase, and A. W. Schardt, 
Rev. (Sci. Instr. \VOl 6245 iNO.mMO; pao, 
June, 1953 


W. F. Schreiber, Ph. D. Thesis, Harvard, 
1953; LRE Convention Record, Part 4, p. 35, 
1953 


KOVAR CUP COLLECTOR 


% > 
Goszzccrnssassss essa astasessIIOTET rae 


Nerppersescpreerss 


GLASS ENVELOPE 


Fig. 1 - Switch tube. 


Fig. 2 - Probability machine -- block diagram. 


102 


THREE ADDITIONAL STAGES 


STAIRCASE GENERATOR “ULTRA LINEAR CLAMP CIRCUIT 
CATHODE FOLLOWER 


Fig.  - Staircase generator and video coupling arrangement. 


103 


~ log p(k) 


TYPICAL SECOND ORDER 
DISTRIBUTION 
p(j,k) va. k 


Fig. 7 - Typical second order distribution. 


TYPICAL THIRD ORDER 
DISTRIBUTION 
p(i,j,k) ve. k, 
iva 53 


~ log p(i, j,k) 


Fig. 9 - Typical third 


10 


TYPICAL FIRST ORDER 
MARGINAL DISTRIBUTION 
PCJ) (v8. fj) ene- en 
p(k) (vs. k)  ——e 


. ® ve 24 32 40 48 56 64 
jork 


Fig. 8 ~ Typical first order marginal distribution. 


40 4d 20 64 


order distribution. 


Fig. 10 - Second order distributions on the monitor scope. The simplest and most complicated subjects, 
together with second order distributions for displacements of 0, 1 and 2 Nyquist intervals. 


105 


GAP ANALYSIS AND SYNTAX * 


Victor H. Yngve 
Department of Modern Languages and Research Laboratory of Electronics 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


Summary 


A statistical procedure has been tried as 
a method of investigating the structure of lan- 
guage with the aid of data processing machines. 
The frequency of gaps of various lengths between 
occurrences of two specified words is counted. 
The results are compared with what would be ex- 
pected if the occurrences of the two words were 
statistically independent. Deviations from the 
expected number give clues to the constraints that 
operate between words in a language. 


Introduction 


Language is a very complex communication 
code. One of the tasks of the linguist is to dis- 
cover the. structure of languages, or the rules of 
the codes, and to state them in a simple and con- 
cise way. To do this, he collects for data actual 
samples of a language, looks for regularities in 
the data by applying various procedures of analy- 
sis and an appropriate amount of intuition, forms 
hypotheses, and tests them on more data. When he 
is finished, he has what he calls a description, 
or a grammar of a language. 


Many of the difficulties that the linguist 
faces in his task of discovering and describing 
structure stem from the very complexity of the 
code and from the large amounts of data that must 
be examined. It has been suggested that modern 
data handling techniques using punched card ma- 
chines or electronic digital computers might be 
able to overcome difficulties that arise from the 
sheer bulk of data. The purpose of this paper is 
to discuss a way in which this might be done. 


Our procedure is a statistical one and 
makes use of the fact that order and disorder are 
in a sense complementary. Statistical independ- 
ence implies lack of structure, and any devia- 
tions from randomness can be taken as an indica- 
tion of structure. The procedure, therefore, has 
two parts: one part deals with the setting up of 
an appropriate statistical model of language; the 
other part deals with the deviations from random- 
ness exhibited by language and their interpretsa- 
tion in terms of structure. 


We assume that language can be represented 
as a sequence of symbols. These might be letters, 
phonemic characters, syllables, morphemes, or 
other elements. It is convenient if there is an 


* This work was supported in part by the 
Army (Signal Corps), the Air Force (Office of Sci- 
entific Research, Air Research and Development 
Command), and the Navy (Office of Naval Research) ; 
and in part by the National Science Foundation. 


106 


operational procedure for segmenting the text 
into sevarate symbols. Such a procedure exists 
for words in conventional spelling; they are 
separated in a text by spaces or punctuation. 
For purposes of the example in this paper, we 
have adopted English words as the symbols. 


The obvious first step in the statistical 
analysis of a text is to investigate the relative 
frequency of the different symbols. Counting the 
frequencies of words has been a favorite occupa- 
tion for over 60 years and a considerable amount 
of data exists!. In 1928, E.U. Condon® observed 
that when words are ranked in order of decreasing 
frequency, the product of the frequency and the 
rank is approximately constant. Several explana- 
tions have been offered for this or other formu- 
lations of the word-frequency distribution law. 
Of these, two are especially interesting. 


B. Mandelbrot? assumes that words are sepa-— 
rated by spaces and are spelled with letters to 
which a cost function is attached. A message 
composed of words with the observed frequency dis- 
tribution transmits the maximum amount of informa- 
tion in the sense of Shannon, compatible with a 
given average cost per word. 


H.A. Simon* assumes a simple stochastic 
model. The probability that the next word to 
appear will be one of the words that has already 
appeared n times is set proportional to the total 
number of occurrences of words that have each 
appeared n times. There is a constant probability 
that the next word will be a word that has not 
already occurred. The observed frequency distri- 
bution agrees with the one that will keep the 
fraction of the words that occur n times approsi- 
mately constant. 


The observed distributions are nearly the 
same for all languages. If the assumptions on 
which these explanations are based are valid for 
one language, they are valid for all languages. 
The Mandelbrot explanation involves an economy 
argument; the Simon explanation follows if users 
of a language try to maintain the word frequencies 
that they observe. 


After an investigation of the frequencies 
of individual symbols, the next step of a statis- 
tical analysis of a text is an examination of 
intersymbol constraints. These are of more direct 
concern to the linguist because they are different 
for different languages. 


Intersymbol constraints have been investi- 
gated for various purposes. For cryptanalysis 


purposes, frequency tables of two-letter and 
three-letter sequences have been tabulated. For 
the purposes of estimating the entropy of printed 
English, Shannon? has used various methods of 
measuring the conditional probabilities that vari- 
ous letters will follow certain sequences of 
letters. pone conditional probability concept has 
been used’ as the basis for a model of a human 
being regarded as a talking animal. A grammar is 
conceived as an enormous array or matrix of the 
conditional probabilities that each morpheme in 
the language will be produced after a given se- 
quence of morphemes. A scheme of this sort fo- 
cuses attention on each position in a text and on 
the effect there of the immediately preceding one, 
two, three symbols, etc. 


The method of investigating intersymbol 
constraints reported here is also concerned with 
the conditional probability of finding a given 
word at a certain position in the text. But in- 
stead of specifying the immediately preceding one, 
two, or more,-~text positions and investigating 
the effect on the probability of the words found 
there, we specify certain words or word combina- 
tions and investigate their effect as they are 
moved around in the vicinity of the given word. 
An advantage of this is that it allows more 
direct investigation of the effect of the occur- 
rence of a word on the probabilities some dis- 
tance away. It also allows easy and rather full 
investigation of the effects of the most frequent 
words first. Being most frequent, these words 
have an especially great influence on the gram- 
mar. 


The Procedure of Gap Analysis 


The statistical model that we use is a 
model for atext divided into symbols (words). 
assume that the frequency f of each word W and 
the total number of words N, or the length of the 
text, are given as a result of direct measurement. 
We assume that the probability of occurrence of 
each word is equal to its relative frequency, 
p(W)=f£(W) /N, and is therefore independent of its 
position in the text and of what words are nearby. 


We 


We look for deviations from the assumption 
that the probability of a word occurring is inde- 
pendent of the words in the neighborhood. To do 
this, we choose two different words and investi- 
gate their effect on each other's probability of 
occurrence. Or we can investigate the effect 
that a given word has on other occurrences of the 
same word. We define a gap of type A-B as the 
number of words intervening between an occurrence 
of A and a later occurrence of B. We can have 
gaps of type A-A between two occurrences of the 
same word. For each type of gap, we count the 
number of gaps of length 0, 1, 2, .... This can 
be done easily by machine by collecting one 
sample of text for each occurrence of A. All 
samples should be the same number of words in 
length and should have the occurrence of A at the 
same position. Then, for each of the other word 
positions, the number of occurrences of B in all 
the samples is counted. The results can be 
plotted as a histogram of the number of gaps 


107 


against the gap length. Gaps of type A-B can be 
plotted on the right of the center of the histo- 
gram; gaps of type B-A on the left. 


Several features of the histogram pre- 
sentation of gap data should be noted. 


1. If the probability of occurrence of B 
is independent of its position with respect to A, 
we expect the distribution of gaps to be flat ex- 
cept for statistical fluctuations. 


2. The expected number of gaps of length 
n is then independent of n and can be calculated 
from the given frequencies: 


£(¢,) = p(B) ¢ (a) = 4 AB) 


3. We ignore the effect that the ends of 
the text have in reducing the possible number of 
gaps. Such effects will be small if the gap 
lengths investigated are appreciably smaller than 
the length of the entire text. 


4, A histogram with gaps of type A-B 
plotted on the left and gaps of type B-A plotted 
on the right is a mirror image of a histogram 
with B-A on the left and A-B on the right. 


5. A histogram of gaps of type A-A is syn- 
metrical about the center position. 


6. If a gap had been defined as the number 
of words occurring between an occurrence of A and 
the first occurrence of B, the assumption of sta- 
tistical independence would, of course, lead to 
an exponential distribution instead of a flat one, 
a fact that seems not to have been understood by 
various counters of gaps. We would have for the 
frequency of gaps of length n, 


£(G,) = £(A)[1-p(B)]"p(B) = He) £0) = 


where k = -ln[{1-p(B)] > 0 


For our purposes, gaps of the exponential type are 
not as convenient because they are harder to count 
by machine, require more calculating to obtain the 
expected number and the expected deviations, and 
because histograms with gaps of type A-B on the 
left and B-A on the right are not mirror images of 
those with B-A on the left and A-B on the right 

on account of the different exponentials. 


Trial Application to English Structure 


In order to be in a better position to 
assess the results in a first trial of the above 
procedure, we selected a small sample of a famil- 
far language - English. An article of about ten 
thousand words from a popular magazine was chosen. 
Since this was a rather short article, only six 
of the most frequent words were investigated. The 
total number of words was counted as well as the 
frequency of each of these six words. These 


numbers are tabulated below: 


word frequency 

the 599 

to 252 

of 241 

a 221 

and 207 

in 162 

1682 

(number of words in article) 9490 


It can be seen that these six words alone 
account for over 17.5 per cent of the occurrences 
of words in the text. Punctuation was ignored. 


Using these six words, all fifteen of the 
type A-B gaps were counted, and the six of type 
A-A. The results of the gap counting are pre- 
sented in Figs. 1 and 2. Along the abscissa of 
each histogram are plotted the various word posi- 
tions. For example, in Fig. l-g, the word "the" 
in all of the "the" samples of text is placed at 
the center position. The length of the bars of 
the histogram represents the number of times the 
word "a" appeared at the various text positions 
to the right or left of the "the". The numbers 
along the abscissa give the length of the gap or 
the number of words intervening between the oc- 
currence of the word "the" and the word "a". The 
six histograms of the gaps between two occurrences 
of the same word are shown in Figs. l-a to 1-f. 

In Fig. 1-f, the gaps were counted out to a length 
of 31 words, and since the histogram would be sym- 
metrical anyway, only the right half is plotted. 


The expected height of the histogram bars, 
under the assumption of the statistical independ- 
ence of the two words, is given by the middle 
horizontal line. The upper and lower horizontal 
lines represent deviations amounting to plus or 
minus the square root of the height of the middle 
line. 


Discussion of the Data 


It can be seen that, in general, the histo- 
grams show considerable deviation from what would 
be expected on the assumption of statistical in- 
dependence. These deviations can be attributed 
to syntactic structure. Since our aim is to de- 
velop techniques that can be used in discovering 
structure, it is of interest to see how the devi- 
ations from randomness shown by the data corre- 
late with what is known about the structure of 
English. 


It two words occur together in a structure, 
that particular combination of two words will 
probably occur more frequently than expected on 
the assumption of statistical independence. The 
greater frequency of the particular combinations 
representing structures reduces the probability 
of occurrence of other combinations that do not 
represent structures. 


108 


Figures l-a to 1-f all show a depressed 
region near the center of the histogram. This is 
taken to mean that these words tend not to recur 
immediately. The device of reduplication has 
only a limited use in English; this is probably 
true of many other languages. The length of the 
depressed region gives an idea of the length of 
the structures that frequently occur with these 
words. For example, structures with "the" can be 
expected to have two or three words. This is 
indeed true. But in the case of "and", the de- 
pressed region extends over at least 15 gaps. 

The total number of gaps of length 15 or less be- 
tween occurrences of "and" amounts to only 50 as 
compared to an expected 68. This long depressed 
region can be understood as attributable to the 
fact that "and" not only correlates words, but 
longer structures as well. One of the uses of 
"end" is to coordinate clauses. Two occurrences 
of "and" used for this purpose cannot be closer 
together than the length of a clause. 


Figure l-a also shows that the word "the*® 
has a slight periodicity with a gap length of 
2 to 6. Such a periodicity can result from 
structures like the following:* 


tasks of the (2) 

linguist is to discover the (4) 
structure of languages or the (4) 
rules of the (2) 

difficulties that the (2) 

very complexity of the (3) 


the 
the 
the 
the 
the 
the 


The average gap length between nearest occurrences 
of "the" is about 15 words. It is true that the 
depression at O and 1 must be compensated for by 
an increase elsewhere, but in the absence of other 
constraints, this increase would not cause a peak, 
but would be spread evenly over all other gap 
lengths. There would be fewer positions available 
for "the", but they would all be equally probable. 


The gaps between different occurrences of 
the same word give an exactly symmetrical histo- 
gram, because it is always possible to inter- 
change the words without altering their roles. 
Whenever two different words give an approximately 
symmetrical histogram, it gives us a clue that it 
may be possible to interchange them without alter- 
ing their roles, i.e., they often play the same 
role and can be classed together. 


In Figs. l-g to 1-1, we have collected all 
the rest of the histograms that might be con- 
sidered symmetrical. There is little question 
about the first four, but perhaps the last two do 
deviate from symmetry by more than the statistical 
fluctuations. Fig. 1-k shows possible deviations 
at gap lengths of one'and two; Fig. 1-1 shows 
possible deviations at a gap length of one. Let 
us assume that the top four are symmetrical. On 
this basis we tentatively group together and name: 


a Examples are taken from the first para- 


graphs of this paper. 


"the" and "a" (the article group) 
"of" and "to" and "in" (the preposition 
group) 


keeping "and" separate from all the rest on the 
assumption that Figs. 1-k and 1-1 are not symme- 
trical. 


All of the histograms of Fig. 2 are unsym- 
metrical. For two words to have an unsymmetrical 
gap histogram, they must frequently play differ- 
ent roles with respect to each other, and there- 
fore they should not be grouped together. 


Figures 2-a to 2-f are the six histograms 
that relate an article and a preposition. Our 
tentative grouping is given additional weight 
because these six histograms show certain simila- 
rities that can be attributed to the nature of 
articles and prepositions: The pattern "preposi- 
tion article" occurs, and often with high fre- 
quency, while there are no cases of "article pre- 
position". 


These six histograms also show differences 
between the two articles and between the three 
prepositions. "The" is different from "a" in 
that it is the preferred article after "of". 

This shows up when Fig. 2-a is compared with Fig. 
2-d. "Of" is different from "to", and from "in", 
which is probably a typical English preposition, 
in that "of" frequently follows an article with 
a gap of one or two. This is due to the very 
frequent "genitive" construction: 


the tasks of 
the structures of 
the rules of 
@ grammar of 


"To" is different from "of" and "in" in that it 
has a relatively low and broad peak before "the". 
The lowness of the peak is probably caused by the 
competition between the prepositional use of "to" 
and the use of "to" before an infinitive. The 
broadness is probably caused by an infinitive in 
terposed between "to" and "the": 


to discover the 


Figures 2-g, 2-h, and 2-i show that "and" is 
different from "the", "a", and "in". If we take 
Fig. 1-k and Fig. 1-1 as being unsymmetrical, it 
is also differentiated from "of" and "to". 


The outlying peaks in Figs. 2-b and 2-e 
have not been explained. Perhaps they are sta- 
tistical fluctuations. The accuracy of the 
counting has been verified. 


Conclusions 


The first trial of the use of gap analysis 
for revealing certain aspects of the syntax of a 
language has been quite fruitful in exposing 
certain ways in which the technique can be in- 
proved, and has been quite suggestive of its po- 
tentialities. The technique is certainly not a 


109 


purely mechanical way of investigating the struc- 
ture of language. A considerable amount of in- 
sight into language structure is required in 
order to make best use of the gap histograms as a 
tool of analysis. 


The success of the procedure depends 
largely on the skill with which the text has been 
segmented into symbols. It is felt that this 
particular experiment would have been more mean- 
ingful if morphemes had been used instead of 
words as they are spelled. By using words, how- 
ever, we eliminated much preliminary work. If 
one wants to use morphemes, perhaps it would be 
appropriate to segment into phonemes and use 
statistical procedures directly on the phonemes. 
The frequent morphemes would soon appear as 
frequent patterns of phonemes. If one sticks to 
conventional spelling, it would probably be 
better to include punctuation on a par with words. 


One of the most serious limitations of our 
application of the procedure to English, was the 
shortness of the text that we chose. Conclusions 
could have been drawn with much greater certainty 
if statistical fluctuations had been smaller. In- 
stead of 10,000 words, perhaps 100,000 words 
should be the minimum length of text for gap ana- 
lysis. With a longer text, one could include 
many more of the frequent words because the word 
frequency distribution function begins to level 
off. Also with a longer text, one could take the 
next step of treating frequent constructions in 
the same manner as words and examining their 
effect on words in the vicinity. The construction 
"of the" and "the - of" were particularly fre- 
quent. They are probably as frequent as the 10th 
or 15th ranking word! 


A systematic order of procedure for another 
experiment would be to count the frequencies of 
the words; then to count the gaps, taking first 
those that have the highest product of the fre- 
quencies of the words involved; then to look for 
frequent constructions and collate them into the 
list of word frequencies so that they could be 
used along with the words for further gap counts. 
By comparing the behavior of certain words in the 
vicinity of a two-word construction with the be- 
havior of the words in the vicinity of each indi- 
vidual word involved in that construction, light 
can be shed on the multiple functions of words. 


There are a number of other things that can 
be done with gap histograms. Certain features 
that many histograms or groups of histograms have 
in common, such asthe "prepositional peak", can be 
used as a basis for grouping words together. Then 
the whole group of words can be counted as one to 
increase the number of cases, and rarer words can 
then be examined. On the other hand, the less 
frequent words can be grouped together on the 
basis of their behavior in the vicinity of fre- 
quent constructions. For example, one could group 
together all words that occur in the position be- 
tween "the" and "of". Then the statistical beha- 
vior of the group could be investigated as if it 
were @ single word. There is a certain resem- 


blance here to the use of the substitution frame. 
Perhaps gap analysis will be able to reveal the 
best substitution frames for use with the more 
standard methods of linguistics. 


Since the methods of gap analysis are far 
from being highly developed at this early stage, 
it is rather difficult to draw much of a compari- 
son with the more standard linguistic methods in 
volving informant techniques with native speakers. 
It is particularly difficult, if not impossible, 
to get accurate quantitative information from an 
informant. For this reason, grammars have made 
no pretense of being quantitative, but havemerely 
reported what can occur and what cannot occur. 
Occasionally, linguists add the comment that some- 
thing is "rare" or "usual" or a "favorite" con- 
struction. Statements like this reveal that they 
think that a certain amount of quantitative in- 
formation is relevant, and that they would 
probably give more if they had the technique. 


Gap analysis provides a wealth of numerical 
information, perhaps more than is really relevant. 
The linguist who uses it may have to pick and 
choose. Some types of numerical information about 
a text are of little significance. For example, 
the word "police" might be frequent in a news- 
paper, but the word "circuit" might be frequent 
in an electrical engineering article. 


Because gap analysis is a numerical tech- 
nique, it focuses attention on frequencies and 
numerical results. There is the continual impli- 
cation that these numbers are worth something, 
that they are a relevant part of a grammar. Toa 
certain extent this is true. In general, sentence 
structure is carried by combinations of the fre- 
quent morphemes. Infrequent words cannot indicate 
by their form alone their role in the sentence 
unless they have included in them some frequent 
role-marking morpheme. For example, a nonsense 


110 


word like "sklack" could be a noun, a verb, an 
adjective, or an adverb. But "the sklack" or 
"sklacked", would have definite roles indicated 
by the frequent "the" or "ed". A sentence, then, 
can be considered as a structure of frequent 
morphemes with various open positions in it where 
all the rest of the morphemes can be put, includ- 
ing the infrequent and the new words. The fre- 
quent morphemes and their combinations can be 
considered as role markers for the less frequent 
ones. One can conclude that the frequent mor- 
phemes are the important ones for stating syntac-— 
tic patterns, and that a careful use of gap 
analysis should be able to reveal these patterns. 


References 


Guiraud, P., "Bibliographie Critique de la 
Statistique Linguistique," Spectrum, Utrecht- 
Anvers, 1954. 


Condon, E.U., "Statistics of Vocabulary," 
Science, 67, 300, (1928). 


Mandelbrot, B., "Simple Games of Strategy 
Occurring in Communication through Natural 
Languages," Trans. I.R.F., PGIT-3, (March 
1954), p.124. 


Simon, H.A., "On a Class of Skew Distribution 
Functions," Biometrica, 42 (Dec. 1955), pp. 
425-440. 


Shannon, C.F., "Prediction and Entropy of 
Printed English," Bell Telephone System Mono- 
graph 1819 (1950). 


Hockett, C.F., "A Manual of Phonology," 
Baltimore, Waverly Press, 1955, (Indiana Uni- 
versity Publications in Anthropology and 
Linguistics, Memoir 11) 


NUMBER OF OCCURRENCES 


60 (a) 


56 THE-THE = THE-THE 
52 y y 
eee q ’ 
AL ioe 2 a AE 

% 


af 
\ 
Sk 


| BAW 


aR 
“ee 
x 


ae Lae ( 
24 : Li ye, mae THE-A 
20 vs C4 P 
| WY LOGY 2 4 | zz ese Fe 
16 W4 
CGN A Vise Ag 2) SPO 
WE 24, Wl AZ 


OF-OF 


TTT] Ace! 
KV mo |B 


= pha ss 


g = = ZA 

Ae ee 
LANA LIZ TZN PV ZZ 
TLILALL EL ELL LENGE TTI 


A-A IN-TO TO-IN 


Ay eT Le 


151s Oar, the OLLEL DO. OSS EE 


AH 
IE A LZ CLLLIPAL TID Deed (LALIT ILI LILO 


(f) (1) 
AND- AND 


TO-AND 


8 ws 4] %4 

(a Ra ey el OS Pa Ee 4 PES 2 6 be V7) Tatal'« 

6h Pao oe oe LN PLN NLA VII TLILD 
LL LL LALMILITL LLAN\ASLSSSLLLLI LL 
Ora 4 15 8 4 (0). {@) 4 8 12 15 


GAP LENGTHS 


Fig. 1 - Gaps between occurrences of words. 


11 


a J a (J a 
= PT of TVA OW Be Oe 
rl iO OS LED LI Ie: 
SO LAOTIAN 


SMINSIRN NWN an SS SSS 


Z 7 WA 
ae 
VSbigas Os AE 


Meee esas 
LLL 


4 1N-—AND 
Beis 4 = 
2s". B= 427284 45 

G28928 Ce“44Eo2 Ss DOC A4ADB SaaS 
GUT ASTIN DADS DSI BSL ASIEN LASSE SE GEE AEE 


NUMBER OF OCCURRENCES 


= 
LG 


A— AND 


ALY e : 

iy V V |_| I, Ds. tee A 

4 CLL GE GL AI TELEIL 

15 12 8 4 O-.0 4 8 12 15 15 12 8 4 (0) 
GAP LENGTHS 


INNIS GQMNVUEVUMwyY 
a] |e [a [ee [ane [SS] 


4 8 12 15 


Fig. 2 - Gaps between occurrences of words. 


2 


THREE MODELS FOR THE DESCRIPTION OF LANGUAGE” 


Noam Chomsky 
Department of Modern Languages and Research Laboratory of Blectronics 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


Abstract 


We investigate several conceptions of 
linguistic structure to determine whether or 
not they can provide simple and "revealing" 
grammars that generate all of the sentences 
of English and only these. We find that no 
finite-state Markov process that produces 
symbols with transition from state to state 
can serve as an English grammar. Furthermore, 
the particular subclass of such processes that 
produce n-order statistical approximations to 
English do not come closer, with increasing n, 
to matching the output of an English grammar. 
We formalize. the notions of "phrase structure" 
and show that this gives us a method for 
describing language which is essentially more 
powerful, though still representable as a rather 
elementary type of finite-state process. Never- 
theless, it is successful only when limited to a 
small subset of simple sentences. We study the 
formal properties of a set of grammatical trans- 
formations that carry sentences with phrase 
structure into new sentences with derived phrase 
structure, showing that transformational grammars 
are processes of the same elementary type as 
phrase-structure grammars; that the grammar of 
English is materially simplified if phrase 
structure description is limited to a kernel of 
simple sentences from which all other sentences 
are constructed by repeated transformations; and 
that this view of linguistic structure gives a 
certain insight into the use and understanding 


ef language. 
1. Introduction 


There are two central problems in the 
descriptive study of language. One primary 
concern of the linguist is to discover simple 
and "revealing" grammars for natural languages. 
At the same time, by studying the properties of 
such successful grammars and clarifying the basic 
conceptions that underlies them, he hopes to 
arrive at a general theory of linguistic 
structure. We shall examine certain features of 
these related inquiries. 


The grammar of a language can be viewed as 
a theory of the structure of this language. Any 
scientific theory is based on a certain finite 
set of observations and, by establishing general 
laws stated in terms of certain hypothetical 
constructs, it attempts to account for these 


*This work was supported in part by the Army 
(Signal Corps), the Air Force (Office of Scientific 
Research, Air Research and Development Command), 
and the Navy (Office of Naval Research), and in 
part by a grant from Eastman Kodak Company. 


113 


observations, to show how they are interrelated, 
and to predict an indefinite number of new 
phenomena. A mathematical theory has the 
additional property that predictions follow 
rigorously from the body of theory. Similarly, 
a@ grammar is based on a finite number of observed 
sentences (the linguist's corpus) and it 
"projects" this set to an infinite set of 
grammatical sentences by establishing general 
"laws" (grammatical rules) framed in terms of 
such hypothetical constructs as the particular 
phonemes, words, phrases, and so on, of the 
language under analysis. A properly formulated 
grammar should determine unambiguously the set 
of grammatical sentences. 


General linguistic theory can be viewed as 
a metatheory which is concerned with the problem 
of how to choose such a grammar in the case of 
each particular language on the basis of a finite 
corpus of sentences. In particular, it will 
consider and attempt to explicate the relation 
between the set of grammatical sentences and the 
set of observed sentences. In other words, 
linguistic theory attempts to explain the. ability 
of a speaker to produce and understand. new 
sentences, and to reject as ungrammatical other 
new sequences, on the basis of his limited 
linguistic experience. 


Suppose that for many languages there are 
certain clear cases of grammatical sentences and 
certain clear cases of ungrammatical sequences, 
e.g., (1) and (2), respectively, in English. 


(1) John ate a sandwich 
(2) Sandwich a ate John. 


In this case, we can test the adequacy of a 
proposed linguistic theory by determining, for 
each language, whether or not the clear cases 
are handled properly by the grammars constructed 
in accordance with this theory. For example, if 
a large corpus of English does not happen to 
contain either (1) or (2), we ask whether the 
grammar that is determined for this corpus will 
project the corpus to include (1) and exclude (2) 
Even though such clear cases may provide only a 
weak test of adequacy for the grammar of a given 
language taken in isolation, they provide a very 
strong test for any general linguistic theory and 
for the set of grammars to which it leads, since 
we insist that in the case of each language the 
clear cases be handled properly in a fixed and 
predetermined manner. We can take certain steps 
towards the construction of an operational 
characterization of "grammatical sentence" that 
will provide us with the clear cases required to 
set the task of linguistics significantly. 


Observe, for example, that (1) will be read by an 
English speaker with the normal intonation of a 
sentence of the corpus, while (2) will be read 
with a falling intonation on each word, as will 
any sequence of unrelated words. Other dis- 
tinguishing criteria of the same sort can be 
described. 


Before we can hope to provide a satisfactory 
account of the general relation between observed 
sentences and grammatical sentences, we must 
learn a great deal more about the formal proper- 
ties of each of these sets. This paper is con- 
cerned with the formal structure of the set of 
grammatical sentences. We shall limit ourselves 
to English, and shall assume intuitive knowledge 
of English sentences and nonsentences. We then 
ask what sort of linguistic theory is required as 
a basis for an English grammar that will describe 
the set of English sentences in an interesting 
and satisfactory manner. 


The first step in the linguistic analysis of 
@ language is to provide a finite system of 
representation for its sentences. We shallassume 
that this step has been carried out, and we 
shall deal with languages only in phonemic 
or alphabetic transcription. By a language, 
then, we shall mean a set (finite or infinite) 
of sentences, each of finite length, all 
constructed from a finite alphabet of symbols. 
If A is an alphabet, we shall say that anything 
formed by concatenating the symbols of A is a 
string in A. By a grammar of the language L we 
mean a device of some sort that produces all of 
the strings that are sentences of L and only 
these. 


No matter how we ultimately decide to 
construct linguistic theory, we shall surely 
require that the grammar of any language must be 
finite. It follows that only a countable set of 
gremmars is made available by any linguistic 
theory; hence that uncountably many languages , 
in our general sense, are literally not describable 
in terms of the conception of linguistic structure 
provided by any particular theory. Given a 
proposed theory of linguistic structure, then, it 
is always appropriate to ask the following question: 


(3) Are there interesting languages that are 
simply outside the range of description of the 


proposed type? 


In particular, we shall ask whether English is 
such a language. If it is, then the proposed 
conception of linguistic structure must be judged 
inadequate. If the answer to (3) is negative, we 
go on to ask such questions as the following: 


(4) Can we construct reasonably simple 
grammars for all interesting languages? 

(5) Are such grammars "revealing" in the 
sense that the syntactic structure that they 
exhibit can support semantic analysis, can provide 
insight into the use and understanding of language, 
etc.? 


we shall first examine various conceptions 


of linguistic structure in terms of the possi- 
bility and complexity of description (questions 
(3), (4)). ‘Then, in §6, we shall briefly 
consider the same theories in terms of (5), and 
shall see that we are independently led to the 
same conclusions as to relative adequacy for the 
purposes of linguistics. 


Ze Finite State Markov Processes. 


2.1 The most elementary grammars which, with 

a finite amount of apparatus, will generate an 
infinite number of sentences, are those based on 
a familiar conception of language as a 
particularly simple type of information source, 
namely, a finite-state Markov process. Spe- 
cifically, we define a finite-state grammar G 

as a system with a finite number of states 


= < i<q: <r< 
Sys 954, a set A {a | OSi,jSa; 1S kSN,, for 


each i,jrfof. transition symbols, and a set 
o= { (S,/5,)} of certain pairs of states of G that 


are said to be connected. As the system moves 
from state Ss. to as it produces a symbol 85 5x° 
Suppose that 


A. 


(6) eSieene Ss 
a 
is a sequence of states of G with a, =a, =0, and 


a 
nm 


(Ss. Ss )eC for each i<m. As the system mves 
ees e0 
from Sy to § it produces the symbol 
ot 41 
Aza 


O55 4% 


for some kK<N . Using the arch’ » to signify 
57542 

concatenation, we say that the sequence (6) 

generates all sentences 


8 SY Tans 
a WeyGk, “ay0yk, 


for all appropriate choices of k, (i.e., for 
k,.<N ). The language Ly containing all and 


a 
CoS e a -] 


only such sentences is called the language 
generated by G. 


Thus, to produce a sentence of L. we set the 
system G in the initial state So and we run 


through a sequence of connected states, ending 
again with S_, and producing one of the associated 
transition symbols of A with each transition from 
one state to the next. We say that a language L 
is a finite-state language if L is the set of 


oe 


sentences generated by some finite-state grammar G, 


2.2. Suppose that we take the set A of transition 
symbols to be the set of English phonemes. Ye 

can attempt to construct a finite state grammar G 
which will generate every string of English 
phonemes which is a grammatical sentence of 
English, and only such strings. It is immediately 
evident that the task of constructing a finite- 


state grammar for English can be c j 
simplified if we take A as the got oP ee TE 


morphemes? or words, and construct G 80 that it 
will generate exactly the grammatical stringsof 
these units. We can then complete the grammar 
by giving a finite set of rules that give the 
phonemic spelling of each word or morpheme in 
each context in which it occurs. We shall 
consider briefly the status of such rules in 


$4.1 and § 5.3. 


Before inquiring directly into the problem 
of constructing a finite-state grammar for 
English morpheme or word sequences, let us 
investigate the absolute limits of the set of 
finite-state languages. Suppose that A is the 
alphabet of a language L, that Bye 8, are 

a 


symbols of this alphabet, and that Ce ane - 


is a sentence of L. We say that S has an (i,J)- 
dependency with respect to L if and only if the 
following conditions are met: 


(9)(4) 
(11) 


VaSet 94 Sin 


-there are symbols b;, byeA with the 
property that S} is not asentence of 
L, and Sp is a sentence of L, where S; 
is formed from S by replacing the ith 
symbol of S (namely, ay) by by, and So 
is formed from Sj by replacing the 
jth symbol of S, (namely, aj;) by bj. 


In other words, S has an (1, J)-dependency with 
respect to L if replacement of the i symbol a, 


of S by b, (>, fa; ) requires a corresponding 


th 


replacement of the j symbol a, OLaSeDY,2bi. 


J 
(> #a,) for the resulting string to belong to L. 


We say that p= { (01.8) ++++(cy»6,) bis a 


dependency set for S in L if and only if the 
following conditions are met: 


(10)(41) For 1<i<m, S has an (a,;,f;)- 
dependency with respect to L 
(11) for each i,j, a,< B, 


(444) for each i,j such that i#j, a, Fa, 
and BAB. 
Thus, in a dependency set for S in L every two 
dependencies are distinct in both terms and each 
"determining" element in S precedes all "de- 


termined" elements, where we picture a, as 
determining the choice of 85 : 
i 


Evidently, if S has an m-termed dependency set 
in L, at least 2™ states are necessary in the 
finite-state grammar that generates the 
language L. 


i 


This observation enables us to state a 
necessary condition for finite-state languages. 


(11) Suppose that L is a finite-state 
language. Then there is an m such that no 
sentence S of L has a dependency set of more 
than m terms in L. 

With this condition in mind, we can easily 
construct many nonfinite-state languages. For 


ar 


described in 
inite-state 


Tr. 
(12)(i) L, contains an SARs 


@. 84 De -b, 
a aa DSUs. and in 


general, all sentences consisting 

of n occurrences of a followed by 
exactly n occurrences of b, and only 
these ; 

L, contains aa, b’'d, 

Db ete, bd, 8 Aad blade eee 
and in general, all "mirror-image" 
sentences consisting of a string X 
followed by X in reverse, and only 
these; 


example, the languages ae L,, 
(12) are not describable by any 


INLINL IN 


(44) aed bera; 


n an nAN 
(444) contains a a, b b,a Dd a b, 
pay aya aly? aa bh hee 


and in general, all sentences con- 
sisting of a string X followed by 
the identical string X, and only 
these. 

In lL,» for example, for any m we can find a 


sentence with a dependency set. 4 
D> {(1,2m) ,(2,2m-1),.. ,(m,m+1) } : 
2.3- Turning now to English, we find that there 
are infinite sets of sentences that have dependeng 
sets with more than any fixed number of terms. 
For example, let Sj,S2,... be declarative sentences. 
Then the following are all English sentences: 
(13)(41) If S,, then So. 
(ii) Either S3, or Sy. 
(441) The man who said that Ss, is 
arriving today. 


These sentences have dependencies between "if"- 
"then", "either"-"or", “man®-"is". But we can 
choose S}, S3, Ss which appear between the inter- 
devendent words, as (131), (13441), or (13141) them- 
selves. Proceeding to construct sentences in this 
way we arrive at subparts of English with just the 
mirror image properties of the languages L) and L2 
of (12). Consequently, English fails condition —~ 
(11). English is not a finite-state language, and 
we are forced to reject the theory of language 
under discussion as failing condition (3). 

We might avoid this consequence by an 
arbitrary decree that there is a finite upper 
limit to sentence length in English. This would 
serve no useful purpose, however. The point is 
that there are processes of sentence formation 
that this elementary model for language is 
intrinsically incapable of handling. If no 
finite limitis set for the operation of these 
processes, we can prove the literal inapplica- 
bility of this model. If the processes have a 
limit, then the construction of a finite-state 
grammar will not be literally impossible (since 
a list is a trivial finite-state grammar), but 
this grammar will be so complex as to be of little 
use or interest. Below, we shall study a model 
for grammars that can handle mirror-image lan— 
guages, The extra power of such a model in the 
infinite case is reflected in the fact that it is 
much more useful and revealing if an upper limit 
is set. In general, the assumption that languags 
are infinite is made for the purpose of simpli- 
fying the description. If a grammar has no 
recursive steps (closed loops, in the model 


discussed above) it will be prohibitively complex- - 
it will, in fact, turn out to be little better 

than a list of strings or of morpheme class 

sequences in the case of natural languages. 

does have yecursive devices, it will produce 
infinitely many sentences. 


If it 


2.4 Although we have found that no finite-state 
Markov process that produces sentences from left 
to right can serve as an English grammar, we 
might inquire into the possibility of constructing 
& sequence of such devices that, in some nontrivial 
way, come closer and closer to matching the output 
of a satisfactory English grammar. Suppose, for 
example, that for fixed n we construct a finite- 
state grammar in the following manner: one state 
of the grammar is associated with each sequence of 
English words of length n and the probability that 
the word X will be produced when the system is in 
the state S, is equal to the conditional proba- 
bility of X, given the sequence of n words which 
defines S,. The output of such grammar is 
customarity called an n+l8*t order approximation to 
English. Evidently, as n increases, the output of 
such grammars will come to look more and more like 
English, since longer and longer sequences havea 
high probability of being taken directly from the 
sample of English in which the probabilities were 
determined. This fact has occasionally led to 

the suggestion that a theory of linguistic 
structure might de fashioned on such a model. 


Whatever the other interest of statistical 
approximation in this sense may be, it is clear 
that it can shed no light on the problems of 
grammar. There is no general relation between the 
frequency of a string (or its component parts) and 
its grammaticalness. We can see this most clearly 
by considering such strings as 

(14) colorless green ideas sleep furiously 
which is a grammatical sentence, even though it is 
fair to assume that no pair of its words may ever 
have occurred together in the past. Notice that a 
speaker of English will read (14) with the 
ordinary intonation pattern of an English sentence, 
while he will read the equally unfamiliar string 

(15) furiously sleep ideas green colorless 
with a falling intonation on each word, as in 
the case of any ungrammatical string. Thus (14) 
differs from (15) exactly as (1) differs from (2); 
our tentative operational criterion for gram- 
maticalness supports our intuitive feeling that 
(14) is a grammatical sentence and that (15) is 
not. We might state the problem of grammar, in 
part, as that of explaining and reconstructing 
the ability of an English speaker to recognize 
(1), (14), etc., as grammatical, while rejecting 
(2), (15), etc. But no order of approximation 
model can distinguish (14) from (15) (or an 
indefinite number of similar pairs). Asn 
increases, an nth order approximation to English 


will exclude (as more and more improbable) an 
ever-increasing number of grammatical sentences, 


while it still woutesue vast numbers of completely 
ungrammatical strings. We are forced to conclude 


116 


that there is apparently no significant approach 
to the problems of grammar in this direction. 


Notice that although for every n, a process 
of n-order approximation can be represented as a 
finite-state Markov process, the converse is not 
true. For example, consider the three-state 
process with (S..8)), (S,.8,).(S).8,), 


(S+S5) (85,55) (8, .8,) as its only connected 


states, and with a, b, a, c, b, c as the respect— 
ive transition symbols. This process can be 
represented by the following state diagram: 


This process can produce the sentences AGU se 
abo a,.8 Dob a, a Bopp. a moa 
cob c, cba bc. 6 by De bc, eee BUtEnOe 


a bbc, cb ba, etc. The generated 
language has sentences with dependencies of any 
finite length. 


In §2.4 we argued that there is no 
significant correlation between order of approxi- 
mation and grammaticalness. If we order the 
strings of a given length in terms of order of 
approximation to English, we shall find both 
grammatical and ungrammatical strings scattered 
throughout the list, from top to bottom. Hence 
the notion of statistical approximation appears 
to be irrelevant to grammar. In 9 2.3 we pointed 
out that a much broader class of processes, 
namely, all finite-state Markov processes that 
produce transition symbols, does not include an 
English grammar. That is, if we construct a 
finite-state grammar that produces only English 
sentences, we know that it will fail to produce 
an infinite number of these sentences; in par-— 
ticular, it will fail to produce an infinite 
number of true sentences, false sentences, 
reasonable questions that could be intelligibly 
asked, and the like. Below, we shall investigate 
ea still broader class of processes that might 
provide us with an English grammar. 


- Phrase Structure. 


3.1. Customarily, syntactic description is 
given in terms of what is called "immediate 
constituent analysis." In description of this 
sort the words of a sentence are grouped into 
phrases, these are grouped into smaller consti- 
tuent phrases and so on, until the ultimate 
constituents (generally morphemes/’) are reached. 
These phrases are then classified as noun 
phrases (NP), verb phrases (VP), etc. For 
example, the sentence (17) might be analyzed as 
in the accompanying diagram. 


Evidently, description of sentences in such terms 
permits considerable simplification over the 
word-by-word model, since the composition of a 
complex class of expressions such as NP can be 
stated just once in the grammar, and this class 
can be used as a building block at various 
points in the construction of sentences. We now 
ask what form of grammar corresponds to this 
conception of linguistic structure. 


3.2. <A phrase-structure grammar is defined by a 
finite vocabulary (alphabet) V_, a finite set 2 
of initial strings in V_, and 4 finite set Fof 
rules of the form: X-»Y, where X and Y are 
strings in V,. Bach such rule is interpretedas 
the instruction; rewrite X as Y. For reasons 
that will appear directly, we require that in 
each such [ 2 ,F] grammar 


(18) 2: Laseee = 


n 


F: Xx —- 


Yy is formed from x, by the replacement of a 
single symbol of x, by some string. Neither 


the replaced symbol nor the replacing string 
may be the identity element U of footnote 4. 


Given the [ = ,F] grammar (18), wo say that: 


(19)(4) a string B follows from a string a 
if anZ~X, Wand Bo2OY,~W, for 


some i <m;? 

(41) a derivation of the string S, is a 
sequence D=(S ro 25s) of strings, 
where S,¢ and foreach i<t, Sian 
follows from S 43 

(444) a string S is derivable from (18) 

if there is a derivation of S in 
terms of (18); 
(iv) a derivation of S, 1s terminated if 


there is no string that followsfrom 
S,3 
(v) a’ string S, isa terminal string if 


it is the last line of a terminated 
derivation. 


A derivation is thus roughly analogous toa 
proof, with taken as the axiom system and F 
as the rules of inference. We say that L isa 
derivable language if L is the set of strings 


117 


that are derivable from some [ = ,F] grammar, 
and we say that L is a terminal language if it is 
the set of terminal strings from some system 


ees als 


In every interesting case there will be a 
terminal vocabulary Vn (Vi, @ Vp) that 


exactly characterizes the terminal strings, in 
the sense that every terminal string is a string 
in Vn and no symbol of Vn is rewritten in any of 
the rules of F. In such a case we can interpret 
the terminal strings as constituting the language 
under analysis (with V, as its vocabulary), and 
the derivations of thefe strings as providing 
their phrase structure. 


3.3. As a simple example of a system of the form 
(18), consider the following small part of English 
grammar; 


(20) = : # sentence” #_ 
F: Sentence —> NP VP 
VP — Verb” NP 
NP -—> the“~man, the™ book 
Verb — took 
Among the derivations from (20) we have, in 
particular: 


i? #“sentence~# 
a 


(21) D 
#“the”~ man” Verb” NP” # 
#“the~ man” Verbd™ the” book ~~ # 
#the™ man~ took” the™ book” # 


Dy: #~sentence™ # 
WP VP OH 
#~the™ man™ VP # 
#“the~ man™ Verb” NP” # 
# the ~ man™ took ~ NP # 
#“the~ man™ took™ the~™ book” # 


These derivations are evidently equivalent; they 
differ only in the order in which the rules are 
applied. We can represent this equivalence 
graphically by constructing diagrams that 
correspond, in an obvious way, to derivations. 
Both Dy and Dy reduce to the diagram: 


(22) #“sentence™ # 


fom a 
NP VP 
the Ay Verb NP 


took the book 

The diagram (22) gives the phrase structure of 
the terminal sentence "the man took the book," 
just as in (17). In general, given a derivation 
D of a string S, we say that a substring s of S 
is an X if in the diagram corresponding to D, 8 
is traceable back to a single node, and this node 
is labelled X. Tims given D, or D,, correspond- 
ing to (22), we say that nthe nant is an NP, 
"took™the™ book" is a VP, "the~ book" is an 
NP, "the~man™ took “the ™book" is a Sentence. 
tman™took," however, is not a phrase of this 


string at all, since it is not traceable back 
to any node. 


When we attempt to construct the simplest 
possible { = ,F] grammar for English we find that 
certain sentences automatically receive non- 
equivalent derivations. Along with (20), the 
grammar of English will certainly have to contain 
such rules as 


(23) Verb —are’ flying 
Verb —- are 
NP —= they 
NP —- planes 
NWP —> flying“ planes 


in order to account for such sentences as "they 
are flying - a plane" (NP-Verb-NP), "(flying) 
planes - are - noisy" (NP-Verb-adjective), etc. 
But this set of rules provides us with two nom 
equivalent derivations of the sentence "they are 
flying planes", reducing to the diagrams: 


(24) # “sentence ~# # “Sentence ~# 
nie Sih eo 
NP VP 
“ule Verb NP they Verb NP 


are flying planes are flying planes 


Hence this sentence will have two phrase 
structures assigned to it; it can be analyzed as 
"they - are - flying planes" or "they - are flying 
- planes." And in fact, this sentence is 
ambiguous in just this way; we can understand it 
as meaning that "those specks on the horizon - 
are - flying planes" or "those pilots -—are flying 
- planes." When the simplest grammar automatic- 
ally provides nonequivalent derivations for some 
sentence, we say that we have a case of 
constructional homonymity, and we can suggest 
this formal property as an explanation for the 
semantic ambiguity of the sentence in question. 
In 1 we posed the requirement that grammars 
offer insight into the use and understanding of 
language (cf.(5)). One way to test the adequacy 
of a grammar is by determining whether or not 
the cases of constructional homonymity are 
actually cases of semantic ambiguity, as in (24) 
We return to this important problem in § 6. 


In (20)-(24) the element # indicated 
sentence (later, word) boundary. It can be 
taken as an element of the terminal vocabulary 
V, discussed in the final paragraph of § 3.2. 


3.4. These segments of English grammar are much 
oversimplified in several respects. For one 
thing, each rule of (20) and (23) has only a 
single symbol on the left, although we placed no 
such limitation on [ 2 ,F] grammars in ¢ 3.2. 

A rule of the form 


(25) 2° X"W—-Z7Y~w 
indicates that X can be rewritten as Y only in 


the context Z--W. It can easily be shown that 
the grammar will be much simplified if we permit 


118 


such rules. In § 3.2 we required that in such 

a rule as (25), X mst be a single symbol. This 
ensures that a phrase-structure diagram will be 
constructible from any derivation. The grammar 
can also be simplified very greatly if we order 
the rules and require that they be applied in 
sequence (beginning again with the first mle after 
applying the final rule of the sequence), and if 
we distinguish between obligatory rules which 
must be applied when we reach them inthe sequence 
and optional rules which may or may not be 
applied. These revisions do not modify the 
generative power of the grammar, although they 
lead to considerable simplification. 


It seems reasonable to require for 
significance some guarantee that the grammar will 
actually generate a large number of sentences in 
a limited amount of time; more specifically, that 
it be impossible to run through the sequence of 
rules vacuously (applying no rule) unless the 
last line of the derivation under construction is 
a terminal string. We can meet this requirement 
by posing certain conditions on the occurrence of 
obligatory rules in the sequence of rules. We 
define a proper grammar as a system { > ,Q], 
where > is a set of initial strings and Q a 
sequence of rules X,-—~- Y, as in (18), with the 
additional condition that for each i there must 
be at least one j such that X,=X, and X,Y, is 
an obligatory rule. Thus, eath. dert—hata tetm of 
the rules of (18) must appear in at least one 
obligatory rule. This is the weakest simple 
condition that guarantees that a nonterminated 
derivation must advance at least one step every 
time we run through the rules. It provides that 
af x, can be rewritten as one of Y,4 veoely 

1 


then at least one of these rewritings must take 
place. However, proper grammars are essentially 
different from [ 2 ,F) grammars. Let D(G) de 
the set of derivations producible from a phrase 
structure grammar G, whether proper or no Let 
Dp={2(G) | Ga 2 ,F) erammar } and Dy= D(G) | G 


a@ proper grammar?>. Then 


(26). Dy, and Dy are incomparable; i.e., 
Dy F Dy and Dy F Dy. 


That is, there are systems of phrase structure 
that can be described by [ © ,F] grammars but not 
by proper grammars, and others that can be 
described by proper grammars but not by { = ,F] 
grammars. 


3.5. We have defined three types of language: 
finite-state languages (in §2.1), derivable and 
terminal languages (in §3.2). These are related 
in the following way: 


(27)(41) every finite-state language is a 
terminal language, but not conversely; 
(ii) every derivable language is a terminal 
language, but not conversely; 
(iii) there are derivable, nonfinite-state 
languages and finite-state, nonderivable 
languages. 


Suppose that Lg is a finite-state language 


with the finite-state grammar G as in § 2.1. 

We construct 2 ( = .¥) grammar in the following 
manner: > = ; F conteins a. rule of the form 
(281) for ser? 4,j,k such that (Sy s jee j#0, 


and k < ¥y43 F contains a rule of wert form (2811) 
for each i,k such that (S,,S,)eC and kK < Nj, 


(28)(4) 8 
(44) Ss, a 


n 
Sead 26 


iok 


Clearly, the terminal language from this [2 ,F] 
grammar will be exactly on establishing the 
first part of (271). 


In § 2.2 we found that l, 1, and L, of 
(12) were not finite-state languages. L, and L,, 
however, are terminal languages. For Ly» C.&-, 
we have the [ = ,F] grammar 


(29) cae sot #52 
F: Z—a’d 
Za Z~ bd 


This esteblishes (271i). 


Suppose that Ly is a derivable language with 


the vocabulary ve s 4 soot } - Suppose that we 
add to the grammh r of of a finite set of rules 


a,—e»b,, where the db oe are not in Vp and are 
all distinct. Then this new grammar gives a 
terminal language which is simply a notational 
variant of L,. Thus every derivable language 
is also terminal. 


As an example of a terminal, nonderivable 
language consider the language L, containing just 
the strings 


(30) ab, cad 
e“c oad 
An infinite derivable language must contain an 
infinite set of strings that can be arranged ina 
sequence S1»Soree- in such a way that for same 


rule X--Y, Sy follows from S53 by application 


of this rule, for each i>l. And Y in this mle 
must be formed from X by replacement of a single 
symbol of X by a string (cf. (18)). This is 
evidently impossible in the case of L This 
language is, however, the terminal ieaceaes given 
by the following grammar: 


(32). 2s 
ae 


An example of a finite-state, nonderivable 
language is the language L, containing all and 
only the strings consisting of 2n or jn 
occurrences of a, for n=1,2,.... language L) 


of (12) ie a derivable, nonfinite-state lenguege, 
with the initial string a“*b and the rule: 
ab aa bo 


The major import of Theorem (27) is that 
description in terms of phrase structure is 
essentially more powerful (not just simpler) than 
description in terms of the finite-state grammars 
that produce sentences from left to right. In 
§ 2.3 we found that English is literelly beyond 
the bounds of these gremmars because of mirror— 
image properties that it shares with Ly and L 


: 2 
of (12). We have just seen, however, that Ly 


is a terminal language and the same is true of 

- Hence, the considerations that led us to 
réject the finite-state model do not similarly 
lead us to reject the more poweriul phrase- 
structure model. 


Note that the latter is more abstract than 
the finite-state model in the sense that symbols 
that are not included in the vocabulary of a 
language enter into the description of this 
language. In the terms of 9 3.2, Vp properly 


includes Vins Thus in the case of (29), we 


describe L, in terms of an element :Z which is not 
in L,; and in the case of (20)-(24), we introduce 
such symbols as Sentence, NP, VP, etc., which are 
not words of English, into the description of 
English structure. 


3.6. We can interpret a [ 2 ,F] grammar of the 
form (18) as a rather elementary finite-state 
process in the following way. Consider a system 
that has a finite number of states Sore eS ° 


When in state S5° it can produce any cf the 


strings of = , thereby moving into a new state. 
Its state at any point is determined by the sub- 
set of elements of Kyree KX, contained as sub- 


strings in the last produced string, and it moves 
to a new state by applying one of the rules to 
this string, thus producing a new string. The 
system returns to state S_ with the production 
of a terminal string. This system thus produces 
derivations, in the sense of § 3.2. The process 
is determined at any point by its present state 
and by the last string that has been produced, 
and there is a-finite upper bound on the amount 
of inspection of this string that is necessary 
before the process can continue, producing a new 
string that differs in one of a finite number of 
ways from its last output. 


It is not difficult to construct languages 
that are beyond the range of description of 
( = ,F) gremmars. In fact, the language L, of 


(12141) is evidently not a terminal language. I 
do not know whether English is actually a terminal 
language or whether there are other actual 
languages that are literally beyond the bounds of 
phrase structure description. Hence I see no way 
to disqualify this theory of linguistic structure 
on the basis of consideration (3). When we turn 
to the question of the complexity of description 
(cf. (4)), however, we find that there are ample 
grounds for the conclusion that this theory of 
linguistic structure is fundamentally inadequate. 
We shall now investigate a few of the problems 


that arise when we attempt to extend (20) to a 
full-scale grammar of English. 


4, JInadequacies of Phrase-Structure Grammar 


4.1. In (20) we considered only one way of 
developing theelement Verb, namely, as "took". 
But even with the verb stem fixed there are 
@ great many other forms that could appear in 
the context "the man --~ the book," e.g., “takes,” 
"has taken," "has been taking," "is taking," 
®has been taken," "will be taking,” and se on. 
A direct description of this set of elements 
would be fairly complex, because of the heavy 
dependencies among them (e.g., "has taken" but 
not "has taking," "is being taken" but not "is 
being taking," etc.). We can, in fact, give a 
very simple analysis of "Verb" as ea sequence of 
independent elements, but only by selecting as 
elements certain discontinuous strings. For 
example, in the phrase "has been taking" we can 
separate out the discontinuous elements “has..en,' 
"be..ing," and "take", and we can then say that 
these elements combine freely. Following this 
course systematically, we replace the last rule 
in (20) by 
(32) (i) Verb —-Auxiliary™V 
(ii) V— take, eat,... 
(444) Auxiliary --C(M)(have “en) (be ~ ing) 
(be~ en) 
(iv) M-—-will, can, shall, may, must 
(v) 


C —- past, present 

The notations in (324111) are to be inter- 
preted as follows: in developing "Auxiliary" 
in a derivation we must choose the unparenthe- 
sized element C, and we may choose zero or more 
of the parenthesised elements, in the order 
given. Thus, in continuing the derivation 
a of (21) below line five, we might proceed 
a 


tollows: 
(33) #% the™ man” Verd™ the” book ~# 
{from D, of (21)] 
#~ the~ man™ Auxiliary™ V~ the” book ~ # 
[(324)] 
# ~ the™ man” Auxiliary™ take the book # 
( (3214) ] 
# ™ the ™man™ C™ have~ en™ be™ ing ~ take“ 
the~™ book ~ # 
( (32411), choosing the elements C, 
have’en, and be’ ing] 
# “the ~ man™ past ~ have~ en“ be” ing” take~ 
the book ~ # 
((32v)] 


Suppose that we define the class Af as containing 
the affixes "en", "ing", and the C's; and the 


class v as including all V's, M's, "have",and "be." 


We can then convert the last line of (33) into a 
properly ordered sequence of morphemes by the 
following rule: 


(34) atv vat # 


120 


Applying this rule to each of the three Af v 

sequences in the last line of (33), we derive 

(35) # the man” have” past ~ # been” # ~ 
take “ing” # “the” book” #. 


In the first paragraph of § 2.2 we mentioned 
that a grammar will contain a set of rules (called 
morphophonemic rules) which convert strings of 
morphemes into strings of phonemes. In the 
morphophonemics of English, we shall have such 
rules as the following (we use conventional, 
rather than phonemic orthography): 


(36) have“ past -—= had 
been -— been 
take“ ing -e taking 
will” past —= would 


can “past -—= could 
M“present -= M 

walk ~past -© walked 
take past — took 
etce 


Applying the morphophonemic rules to (35) we 
derive the sentence; 
(37) the man had been taking the book. 
Similarly, with one major exception to be 
discussed below (and several minor ones that we 
shall overlook here), the rules (32), (34) will 
give all the other forms of the verb in 
declarative sentences, and only these forms. 


This very simple analysis, however, goes 
beyond the bounds of { = ,F] grammars in several 
respects. The rule (34), although it is quite 
simple, cannot be incorporated within a [ 2 ,F] 
grammar, which has no place for discontinuous 
elements. Furthermore, to apply the rule (34) 
to the last line of (33) we must know that "take" 
is a V, hence, av. In other words, in order to 
apply this rule it is necessary to inspect more 
than just the string to which the rule applies; 
it is necessary to know some of the constituent 
structure of this string, or equivalently 
(cf. § 3.3), to inspect certain earlier lines in 
its derivation. Since (34) requires knowledge of 
the ‘history of derivation! of a string, it violates 
the elementary property of {> ,¥] grammars 
discussed in § 3.6. 


4.2. The fact that this simple analysis of the 
verb phrase as a sequence of independently chosen 
units goes beyond the bounds of [ 2 ,F] grammars, 
suggests that such grammars are too limited to 
give a true picture of linguistic structure. 
Further study of the verb phrase lends additional 
support to this conclusion. There is one major 
limitation on the independence of the elements 
introduced in (32). If we choose an intransitive 
verb (e.g., "come," "occur," etc.) as V in (32), 
we cannot select be“*en as an auxiliary. We can- 
not have such phrases as "John has been come,! 
"John is occurred," and the like. Furthermore, 
the element be”*en cannot be chosen independently 
of the context of the phrase "Verb." If we have 


the element "Verb" in the context "the man -- the 
food," we are constrained not to select be“‘en in 
applying (32), although we are free to choose any 
other element of (32). That is, we can have "the 
men is eating the food," "the man would have been 
eating the food," etc., but not "the man is eaten 
the food," "the man would have been eaten the food," 
etc. On the other hand, if the context of the 
phrase "Verb" is, e.g., "the food -- by the man," 
we are required to select been. We can have 

"the food is eaten by the man," but not "the food 
is eating by the man,"® etc. In short, we find that 
the element be“ en enters into a detailed network 
of restrictions which distinguish it from all the 
other elements introduced in the analysis of "Verb" 
in (32). This complex and unique behavior of 
been suggests that it would be desirable to 
exclude it from (32) and to introduce passives into 
the grammar in some other way. 


There is, in fact, a very simple way to 
incorporate sentences with been (i.e., passives) 
into the grammar. Notice that for every active 
sentence such as "the man ate the food" we have a 
corresponding passive "the food was eaten by the 
man" and conversely. Suppose then that we drop the 
element been from (32111), and then add to the 
erammar the following rule: 


(38) If S is a sentence of the form NP,- 
Auxiliary-V-NP., then the corresponding string 
of the form NP2-Auxiliary “be “en-V—-by~ NP) is 
also a sentence. 


For example, if "the man — past — eat the 
food" (NP) -Auxiliary-V-NP,,) is a sentence, then 


"the food - past be en - eat - by the man" (NPQ- 
Auxiliary™be“en-V-byNP,) is also a sentence. 
Rules (34) and (36) would convert the first of 
these into "the man ate the food" and the 

second into "the food was eaten by the man." _ 

The advantages of this analysis oft passives 
are unmistakable. Since the element been has 
been dropped from (32) it is no longer necessary to 
qualify (32) with the complex of restrictions 
discussed above. ‘The fact that been can occur 
only with transitive verbs, that it is excluded in 
the context "the man -- the food" and that it is 
required in the context "the food -- by the man," 
is now, in each case, an automatic consequence of 
the analysis we have just given. 


A rule of the form (38), however, is well 
beyond the limits of phrase-structure grammars. 
Like (34), it rearranges the elements of the string 
to which it applies, and it requires considerable 
information about the constituent structure of this 
string. When we carry the detailed study of English 
syntax further, we find that there are many other 
cases in which the grammar can be simplified if the 
{ = ,F] system is supplemented by rules of the same 
general form as (38). Let us call each such rule a 
grammatical transformation. As our third model for 
the description of linguistic structure, we now 
consider briefly the formal properties of a trans- 
formational grammar that can be adjoined to the 
{ = ,¥) grammar of phrase structure. 


5. Transformational Grammar. 


5.1. Each grammatical transformation T will 
essentially be a rule that converts every sentence 
with a given constituent structure into a new 
sentence with derived constituent structure. The 
transform and its derived structure must be related 
in a fixed and constant way to the structure of 
the transformed string, for each T. We can 
characterize T by stating, in structural terms, 
the domain of strings to which it applies and the 
change that it effects on any such string. 

Let us suppose in the following discussion 
that we have a [ 2 ,F] grammar with a vocabulary 


Vp and a terminal vocabulary Va Vor as in § 3.2. 


In § 3.3 we showed that a [ = ,F]) grammar 
permits the derivation of terminal strings, and we 
pointed out that in general a given terminal string 
will have several equivalent derivations. Two 
derivations were said to be equivalent if they 
reduce to the same diagram of the form (22), etc. 
Suppose that Di»-- dD, constitute a maximal set 


9 


of equivalent derivations of a terminal string 

S. Then we define a phrase marker of S as the set 
of strings that occur as lines in the derivations 
Dj»--D,- A string will have more than one phrase 


marker if and only if it hes nonequivalent 
derivations (cf. (24)). 


Suppose that K is a phrase marker of S. We 
say that 


(39) (S,K) is analyzable into (X, +++ X Ree s 
and only if there are strings By 10098 such that 
(i) S=s,~ Ao 8. ty 
(ii) for each ifn, K contains the string 


Ce SX. -8 Col Paras) 


oof 8 
il del “i itl 


(40) In_this case, s 
respect to K. 


4 is an x, in S with 


The relation defined in (40) is exactly the 
relation "is a" defined in § 3.33 1.8%, 8, is an 
X, in the sense of (40) if and only if 8, is a 
sibstring of S which is traceable back td a single 
node of the diagram of the form (22), etc., and 
this node is labelled X,- 

The notion of analyzability defined above 
allows us to specify precisely the domain of 
application of any transformation. We associate 
with each transformation a restricting class R 
defined as follows: 


(41) RB is a restricting class if and only if 
for some r,m, R is the set of sequences: 


1 1 
X) +++ X. 
n- n 
Xie° ° x, 


where a is a string in the vocabulary Vp» for each 


i,je We then say that a string S with the phrase 
marker K belongs to the domain of the transformation 


121 


T if the restricting class R associated with T 


contains a sequence (xtiacak®) into which (S,K) is 


The domain of a transformation is 
a set of ordered pairs (S,K) of a string S 


and a phrase marker K of S. A transformation may 
be applicable to S with one phrase marker, but not 
with a second phrese marker, in the case ofa string 
S with ambiguous constituent structure. 


analyzable. 


In particular, the passive transformation 
described in (38) has associated with it ea 
restricting class R_ containing just one sequence: 


(42) R= { (NP, Auxiliary, V, xp) } ‘ 


This transformation can thus be applied to any 
string that is analyzable into an NP followed by 
an Auxiliery followed by 5 V followed by an NP. 
For example, it can be applied to the string (43) 
analyzed into substrings Byres By in accordance 
with the dashes. 


(43) 


5.2. In this way, we can describe in structural 
terms the set of strings (with phrase markers) to 
which any transforration applies. We must now 
specify the structural change that a transformation 
effects on any string in its domain. An elementary 
transformation t is defined by the following 
property: 


(44) for each pair of integers n,r (n<r), 
there is a unique sequence of integers (8 8y 9+ + 1% 
and a unique sequence of strings in V> Vigne 
such that (i)a =0; k20;1S as r for 1<j<kix =u 


the man - past - eat - the food. 


(414) 
Titel, =! 


for each no 


La Cag O8) ei ll CN LS 


an has beau WV ASOD Gea citi 4 . 
ieee ay au ay 2 By a, Zi! 
Ths t can be understood as converting the 
occurrence of Y, in the context 


t(Y. 


oO? “ oOo oO 
(5) pt eee gore 


into a certain string Y 50 
tous a's 
which is unique, given the sequence of terms 

(Yo. ..,¥.)u into which Y-O..@ Yau isisubdivided: 

t Zarried the string Y ~.o-¥,, into a new string 
Wi--.W which is related in a fixed way to 
Y,-..-Y_. More precisely, we associate with t the 
adrived’ transformation t*s 


(46) +t* is the derived transformation of t if 
and only {ft forses): er t¥(T) 06s Y Jae) Reel be 


where wiat(y Taityeee Y,) for each n&r. 


sa 
We now associate with each transformation T 
an elementary transformation t. For example, with 
the passive transformation (38) we associate the 
elementary transformation > defined es follows: 


, es ta) “n 
t (1 T2sTQ.¥5.%y) Y, been 


122 


(Ty 1% 9675 i¥4 Ly) = ¥, 

to (yee Ty sky) = by“, 
tty. a, iTs osu Bg)! Sad) for all nSr#. 

The derived transformation t* thus has the follow- 
ing effect: P 


(48)(4) t8(7),..4%,) = 1) - YJ‘be™en - 3 vy Y 
(41) t*(the man, past, eat, the’ food) = 


the food - past™ be” en - eat - bythe” 
man. 
The rules (34),(36) carry the right-hand side of 
(48ii) into "the food was eaten by the man," just 
as they carry (43) into the corresponding active 
"the man ate the food," 

The pair (R_,t_) as in (42),(47) completely 
charactsrizes thi ssive transformation as 
described in (38). R_ tells us to which strings 
this transformation applies (given the phrase 
markers of these strings) and how to subdivide 
these strings in order to apply the transformation, 
and t~_ tells us what structural change to effect 
on thB subdivided string. 

4 grammatical transformation is specified 
completely by a restricting class R and an 
elementary transformation t, each of which is 
finitely characterizable, as in the case of the 
passive, It is not difficult to define rigorously 
the manner of this specification, along the lines 
sketched above. To complete the development of 
transformational grammar it is necessary to show 
how a transformation automatically assigns a 
derived phrase marker to each transform and to 
generalize to transformations on sets of strings. 
(These and related topics are treated in reference 
(3].) <A transformation will then carry a string S 
with a phrase marker K (or a set of such pairs) 
into a string S' with a derived phrase marker K'. 


5.3. From these considerations we are led to a 
picture of grammars as possessing a tripartite 
structure. Corresponding to the phrase structure 
analysis we have a sequence of rules of the form 
X--Y, ©.g., (20), (23), (32). Following this we 


have a sequence of transformational rules such 
(34) and (38). Finally, we have a sequence of ad 


morphophonemic rules such as (36), again of the 
form X-*Y. To generate a sentence from such a 
grammar we construct an extended derivation 
beginning with an initial string of the phrase 
structure grammar, e.g., #° Sentence #, as in 
(20). We then run through the rules of phrase 
structure, producing a terminal string. We then 
apply certain transformations, giving a string of 
morphemes in the correct order, perhaps quite a 
different string from the original terminal string, 
Application of the morphophonemic rules converts 
this into a string of phonemes. We might run 
through the phrase structure grammar several times 
and then apply a generalized transformation to the 
resulting set of terminal strings. 


In § 3.4 we noted that it is advantageous to 
order the rules of phrase structure into a 
sequence, and to distinguish obligatory from 
optional rules. The same is true of the trans- 
formational part of the grammar. In 94 we dis- 
cussed the transformation (34), which converts a 


1 


sequence affix-verb into the sequence verb-affix, 
and the passive transformation (38). Notice that 
(34) must be applied in every extended derivation, 
or the result will not be a grammatical sentence. 
Rule (34), then, is an obligatory transformation. 
The passive transformation, however, may or may 
not be applied; either way we have a sentence. The 
passive is thus an optional transformation. This 
distinction between optional and obligatory trans- 
formations leads us to distinguish between two 
classes of sentences of the language. We have, on 
the one hand, a kernel of basic sentences that are 
derived from the terminal strings of the phrase- 
structure grammar by application of only 
obligatory transformations. We then have a set of 
derived sentences that are generated by applying 
optional transformations to the strings underlying 
kernel sentences. 

When we actually carry out a detailed study 
of English structure, we find that the grammar can 
be greatly simplified if we limit the kernel to a 
very small set of simple, active, declarative 
sentences (in fact, probably a finite set) such as 
"the man ate the food," etc. We then derive 
questions, passives, sentences with conjunction, 
sentences with compound noun phrases (e.g., 
"proving that theorem was difficult," with the NP 
"proving that theorem"), etc., by transformations. 
Since the result of a transformation is a sentence 
with derived constituent structure, transformatims 
can be compounded, and we can form questions from 
passives (e.g., "was the food eaten by the man"), 
etc. The actual sentences of real life are 
usually not kernel sentences, but rather 
complicated transforms of these. We find, however, 
that the transformations are, by and large, meaning 
preserving, so that we can view the kernel 
sentences underlying a given sentence as being, in 
some sense, the elementary "content elements" in 
terms of which the actual transform is “understood." 
We discuss this problem briefly in § 6, more 
extensively in references [1], [2]. 


In $3.6 we pointed out that a grammar of 
phrase structure is a rather elementary type of 
finite-state process that is determined at each 
point by its present state and a bounded amount of 
its last output. We discovered in $4 that this 
limitation is too severe, and that the grammar can 
be simplified by adding transformational rules 
that take into account a certain amount of 
constituent structure (i.e., a certain history of 
derivation). However, each transformation is still 
finitely characterizable (cf. §§ 5.1-2), and the 


finite restricting class (41) associated with a 
transformation indicates how much information 


about a string is needed in order to apply this 


transformation. The grammar can therefore still 
be regarded as an elementary finite-state process 


of the type corresponding to phrase structure. 
There is still a bound, for each grammar, on how 
much of the past output must be inspected in order 
for the process of derivation to continue, even 
though more than just the last output (the last 
line of the derivation) must be known. 


6. Explanatory Power of Linguistic Theories 


We have thus far considered the relative 
adequacy of theories of linguistic structure only 


123 


in terms of such essentially formal criteria as 
simplicity. In l we suggested that there are 
other relevant considerations of adequacy for 
such theories. We can ask (cf.(5)) whether or 
not the syntactic structure revealed by these 
theories provides insight into the use and under- 
standing of language. We can barely touch on 
this problem here, but even this brief discussion 
will suggest that this criterion provides the 
same order of relative adequacy for the three 
models we have considered. 


If the grammar of a language is to provide 
insight into the way the language is understood, 
it must be true, in particular, that if a sentence 
is ambiguous (understood in more than one way), 
then this sentence is provided with alternative 
analyses by the grammar. In other words, if a 
certain sentence S is ambiguous, we can test the 
adequacy of a given linguistic theory by asking 
whether or not the simplest grammar constructible 
in terms of this theory for the language in 
question automatically provides distinct ways of 
generating the sentence S. It is instructive to 
compare the Markov process, phrase-structure, and 
transformational models in the light of this test. 


In §3.3 we pointed out that the simplest 
{ 2 ,F) grammer for English happens to provide 
nonequivalent derivations for the sentence "they 
are flying planes," which is, in fact, ambiguous. 
This reasoning does not appear to carry over for 
finite-state grammars, however. That is, there 
is no obvious motivation for assigning two 
different paths to this ambiguous sentence in any 
finite-state grammar that might be proposed for 
a part of English. Such examples of construction- 
al homonymity (there are many others) constitute 
independent evidence for the superiority of the 
phrase-structure model over finite-state grammars. 


Further investigation of English brings to 
light examples that are not easily explained in 
terms of phrase structure. Consider the phrase 


(49) the shooting of the hunters. 


We can understand this phrase with "hunters" as 
the subject, analogously to (50), or as the 
object, analogously to (51). 


(50) the growling of lions 
(51) the raising of flowers. 


Phrases (50) and (51), however, are not similarly 
ambiguous. Yet in terms of phrase structure, each 
of these phrases is represented as: the - Ving - 
of “NP. 


Careful analysis of English shows that we can 
simplify the grammar if we strike the phrases 
(49)-(51) out of the kernel and reintroduce them 
transformationally by a transformation T) that 


Carries such sentences as "lions growl" into (50), 
and a transformation Tg that carries such sentences 


as "they raise flowers" into (51). Ty) and T, will 


be similar to the nominalizing transformation 
described in fn.12, when they are correctly 
constructed. But both "hunters shoot" and "they 
shoot the hunters" are kernel sentences; and 
application of T) to the former and Tt, to the 


latter yields the result (49). Hence (49) has 

two distinct transformational origins. It is a 
case of constructional homonymity on the trans- 
formational level. The ambiguity of the grammat- 
ical relation in (49) is a consequence of the fact 
that the relation of "shoot" to "hunters!" 

differs in the two underlying kernel sentences. 

We do not have this ambiguity in the case of (50), 
(51), since neither "they growl lions" nor 
“flowers raise" is a grammatical kernel sentence. 


There are many other examples of the same 
general kind (cf. [1],[2]), and to my mind, they 
provide quite convincing evidence not only for 
the greater adequacy of the transformational 
conception of linguistic structure, but also for 
the view expressed in § 5.4 that transformational 
analysis enables us to reduce partially the 
problem of explaining howvwe understand a sentence 
to that of explaining how we understand a kernel 
sentence. 


In summary, then, we picture a language as 
having a small, possibly finite kernel of basic 
sentences with phrase structure in the sense of 
§3, along with a set of transformations which 
can be applied to kernel sentences or to earlier 
transforms to produce new and more complicated 
sentences from elementary components. We have 
seen certain indications that this approach may 
enable us to reduce the immense complexity of 
actual language to manageable proportions and, in 
addition, that it may provide considerable insight 
into the actual use and understanding of language. 
Footnotes 


1. cf. [7]. Finite-state grammars can be 
represented graphically by state diagrams, as 
in [7], p.15f. 


2. See [6], Appendix 2, for an axiomatization of 
concatenation algebras. 


3. By ‘morphemes! we refer to the smallest 
grammatically functioning elements of the 
language, e.g., "boy", "run", "ing" in 
"running", "s! in "books", etc. 


4, In the case of L,, >, of (911i) can be taken 


as an identity element U which has the 
property that for all X, U“X=X~ USK. 
De will also be a dependency set for a 


Then 


sentence of length 2m in L,. 


5. Note that a grammar must reflect and explain 
the ability of a speaker to produce and under 
stand new sentences which may be much longer 
than any he has previously heard. 

6. Thus we can always find sequences of n+l 
words whose first n words and last n words 
may occur, but not in the same sentence (e.g. 


replace "is" by "are" in (13ii11), and choose 
S. of any required length). 


7.Z or W may be the identity element U (cf. fn.4) 
in this case. Note that since we limited (18) 
so as to exclude U from figuring significantlv 
on either the right- or the left-hand side 
of a rule of F, and since we required that 
only a single symbol of the left-hand side 
may be-replaced in any rule, it follows that 
Y, must be at least as long as Xy° Thus we 


have a simple decision procedure for deriv- 
ability and terminality in the sense of 
(19441), (19v). 


8. See [3] for a detailed development of an 
algebra of transformations for linguistic 
description and an account of transforma- 
tional grammar. For further application of 
this type of description to linguistic 
material, see [1], [2], and from a somewhat 
different point of view, [4]. 


9. It is not difficult tO give a rigorous 
definition of the equivalence rélation in 
question, though this is fairly tedious. 

10. The notion "is a® should actually be 
relativized further to a given occurrence 
of 8, in S. We can define an occurrence of 


8, in S as an ordered pair (s,.X), where X 


is an initial substring of S, and s, isa 


a 
final substring of X. Cf. [5], p.297. 

ll. Where U is the identity, as in fn. 4. 

12.° Notice that this sentence requires a 
generalized transformation that operates on 
a pair of strings with their phrase markers. 
Thus we have a transformation that converts 
$1.5, of the forms NP-VP), it-VP,, respect— 


ively, into the string: ingVP, =VE Se lt 
converts S)= “they - prove that theorem" , 
So= "it - was difficult" into "ing prove 


that theorem - was difficult," which by (34) 
becomes "proving that theorem was difficult.* 
cf. (1), [3] for details. 

Bibliography 

[1] Chomsky, N., The Logical Structure of 
Linguistic Theory (mimeographed) . 

[2] Chomsky, N., Syntactic Structures, to be 

published by Mouton & Co., 'S-Gravenhage, 

Netherlands. 

Chomsky, N., Transformational Analysis,Ph. D. 

Dissertation, University of Pennsylvania, 

June, 1955. 

Harris ,Z.S., Discourse Analysis, Language 

28,1 (1952). 

Quine, W.V., Mathematical Logic, revised 

Soe tect Heaven eee Press., Cambridge, 


(3] 


C4) 
[5] 


(6) 
(7) 


Rosenbloom, P., Blements of Mathematical 
Logic, Dover, New York, 1950 
Shannon & Weaver, The Mathematical Theory of 


Communication, University of Illinois Press, 
Urbana, 1949. 


12h 


SOME STUDIES IN THE SPEED OF VISUAL PERCEPTION 


George C. 
RCA Laboratories Division 
Princeton, Neds 


Summary 


Statistical studies of television 
signals indicated a high degree of corre- 
lation between successive elements, lines 
and frames, A continuous test run indi-, 
cated that the normalized detail content 
of actual broadcast signals run lower 
than 5 percent, 


The fact that most transmitted tele- 
vision scenes thus measured contained 
very little detail compared to some arti- 
ficial subjects indicated that the obe 
serverts preference has influenced the 
subject to be transmitted, In order to 
verify this assumption, some tests were 
devised to measure the perception speed 
of observers, These tests included cere 
tain reading and character recognition 
tests and finally a test consisting of 
object recognition in precisely measured 
periods was devised, 


In order to evaluate the perception 
rate in bits, a relationship between the 
gestalt concept and the binary choices 
was sought. Assuming a limited number of 
picturable nouns to correspond to the 
gestalt experience of the observer, each 
object recognized was assigned a value of 
ten bits (corresponding to 102) pictur- 
able nouns). Several series of these 
tests indicated that the visual percep- 
tion speed of a normal observer is 
between 30 to 50 bits per second, that 
this value holds for periods of onee 
tenth to two seconds, and that the first 
thing observed is the center of the 
picture, These values are similar to the 
values obtained by Licklider, Pierce and 
others for reading speeds, 


Introduction 


The statement that the television 
signal is highly redundant and therefore 
very wasteful of channel capacity became 
almost a cliche, Various types of re- 
dundancies have been measured of differ- 
ent subject matters,and indicated.the 
truthfulness of that statement. ~- A 
method proposed by Gouriet provides a 
continuous indication of the entropy 
based on successive elements joint proba- 
bility distribution, if an exponential 
distribution is assumed, The particular 
instrument called the “detail meter" 
differentiates the signal, provides an 
integrated value of the absolute value 
of the derivative, obtained by full wave 
rectification in a normalized form, where 
a maximum change at every picture element 


125 


Sziklai 


is called 100% detail. This instrument 
gives the maximum reading for a tele- 
vision signal corresponding to vertical 
bars (or a checkerboard) with the maximum 
resolution of the system (325 lines) but 
gives a reading of only 33% for a noise 
spectrum since the criteria of the expo- 
nential distribution is not satisfied. 
More significantly however when the in- 
strument is connected to a video program 
line the reading is normally less than 
3% and seldom is a reading in the order 
of 6%.obtained. In the laboratory 
special subjects may be selected with 
substantially higher readings, however 
these scenes would not be considered 
pictures with entertainment value. 


The detail meter thus indicated that 
there is a viewer preference of low detail 
or information content, Although some 
work in this field analyzed so called 
typical pictures we did consider to em- 
bark upon the project of finding either 
typical or preferable pictures. A clue 
was supplied by the psychologists, parti- 
cularly by Prof. Y. LeGrand of Sorbonne, 
who performed some perception tests with 
images, which were to be sketched by an 
artist after a short exposure, These 
tests indicated that even after several 
second exposures only the contours are 
perceived, This is of course simply in 
line with the gestalt recognition of 
subject. 


At this point the problem was then 
to establish a relationship between the 
gestalt concept and some unit familiar to 
the communication engineers to provide a 
display which exposes the subject for a 
variable but precisely measured time and 
to find a suitable way for obtaining a 
measure of the observation, 


One may consider each letter of the 
alphabet, or each number a gestalt and 
then knowing that random letters corres- 
pond to 4.7 bits per symbol, a relation- 
ship between a class of subject and the 
gestalt may be established. The per- 
ception of random letters however cannot 
be measured too easily and they form a 
rather limited class of gestalts. Never- 
theless some reading speed tests and 
short exposure tests were performed and 
some of these results will be reported 
upon at another time. 


The Test Setup 


In order to control the exposure 
period precisely, an electronic switching 


system was developed to switch video sig- 
nals. The switching gear operates from 
the video synchronizing pulses, switching 
from one signal to another at the begin- 
ning of a field and switching back to the 
first signal again after 2,4,8,16, 32,6) 
or 128 fields (or correspondingly after 
approx. 1/30, 1/15, 2/15, 1/h, 1/2, 1 and 
2 seconds). The exposure time to the 
second signal can be set beforehand and 
the changeovers can be actuated when the 
observers are ready by a switch. 


The use of a steady image, which is 
interrupted by the test image was adopted 
with the idea that the reappearing steady 
image does not permit the storage of the 
test image on the retina, This effect 
coul@be verified easily by using a uni- 
form, white gray or black field as the 
steady image. In most of the tests the 
picture shown in Fig. 1 was used as the 
steady image. 


The display device used was a TV 
station monitor with a 12 inch kinescope, 
The observers were asked to view the 
screen as they would view a television 
receiver, Generally a distance from 5 to 
7 of the picture height was used. As a 
preliminary experiment the slide shown in 
Fig. 2 was flashed on for about twenty 
observers and they were asked to recite 
what they have seen as soon after being 
exposed to the picture as possible and a 
stenographer took down their remarks. For 
two fields of exposure (1/30 second), the 
observers noticed a change but could not 
tell what they saw. For an exposure of 
four fields they recognized either the 


horse or the house but not both. In 16 
fields (approx. 1/ sec.), they observed 
two connected objects such as; "a white 
horse pulling something", or “house and 
trees", "house with windows", and even 
reported inventions such as %a pony with 
a rider", "a house and a street and I 
think there was a car in the street", 


In connection with these experiments 
it was already notable that (1) the center 
of the picture was perceived firsts the 
majority of the observers, after recog- 
nizing one object, tried to name more by 
conjectures (3) the number of objects 
recognized by the different observers for 
the same periods were remarkably uniforn, 
and (4) a given total exposure seemed to 
yield the same amount of recognition 
whether it was taken in one exposure or in 
two exposures, each lasting half of the 
total period, 


In the 
was made to 
bits and to 


next experiment an attempt 
evaluate the perception in 
eliminate any relationship 
between the recognizable objects. The 
slide shown in Fig. 3 was prepared and 
the evaluation of each object in bits 
was Biecs on the following reasoning. 
Ogden4 lists 200 picturable nouns among 
his 1000 basic words. Four of the five 
objects were among these, the pyramid 
however was not. Since it would have 
been a formidable job to count all the 
picturable nouns in a complete dictionary, 
a sampling technique was used and a 
factor of h.7 was found for the letter B 
and a factor of 5.1 for the letter S, 
between the basic English list and the 


Rigo. 


126 


Fig. 2 


Webster dictionary®, This provides an 
estimate of 1000 picturable nouns or for 
each gestalt approx. 10 bits of informa- 
tion (log,102h). On the basis of this 
estimate thea it was possible to test the 
ee ee speed in a bits per second 
scale, 


The procedure of the test with slide 
3 was the same as with the previous slide. 
The first tests with 8 fields (approx. 
1/8 second) was inadequate for all ob 
servers to recognize any of the objects. 
A typical answer was, “approx. eight 
small figures in three columns that. 
appeared to be line drawings of mythical 
animals", With an exposure of 16 fields 
(approx. 1/h sec.) all observers recog- 
nized the fish. The test with an ex- 
posure period of 32 fields (approx. 1/2 
sec.) was performed with 50 observers 
including engineers, stenographers, 
janitors, etc, Only one out of the 50 
could name three objects. With one 
exception the rest named the fish and one 
more object, the one exception named the 
pipe and the scissors in that order, [In 
all the tests with any exposure periods 
(about 120 tests altogether), if an 
object was recognized at all, the center 
figure was the first named. 


The recognition of two independent 
objects corresponds to 20 bits and since 
the exposure-was 1/2 second, it corres- 
ponds to )O bits per second perception 
speed. It may be interesting to compare 
this figure with the capacity of a tele- 
vision signal channel. The number of 
different messages that can be constructed 
with 200,000 picture elements per frame 
with 100 distinguishable brightness levels 


Fig. 3 


is 10400,000, 


ship of 
oX 100,000 


a 400,000 
108452 


this corresponds to approx, 1.3 million 
bits per frame or 0 million bits per 
second, 


It was felt that once the technique 
described. above was developed, the per- 
ception speed should be verified with 
gestalts of smaller classes, and there- 
fore shorter periods. One experiment 
with a slide which had objects with names 
all starting with the letter B, failed to 
produce a greater object recognition 
higher than two per 32 fields after the 
observers were instructed about the 
restriction. On discussing the observa— 
tion with some of the viewers, it was 
evident that this type of restriction 
imposes a difficult mental process, which 
apparently causes a slowing down in the 
perception speed, 


Another experiment in which the 
observer was. shown the chart of 16 symbols 
shown in Fig. beforehand and then asked 
to recognize one of these symbols, yield- 
ed excellent consistency between tests. 
and the results obtained with the general 
class of picturable nouns, By flashing 
the single symbol (corresponding to four 
bits) for 4 fields (1/15 of a second) 
therefore 60 bits per second only one out 
of twenty observers guessed the correct 
symbol, which might even be accounted for 
on the basis of chance, When the ex- 
posure period was doubled (8 fields, 


127 


On the basis of the relation 


Fig. 


2/15 sec.) and therefore the perception 
rate demand was reduced to 30 bits per 

second, only one of another batch of 20 
observers failed to recognize the right 
symbol, 


The importance of the steady picture 
before and after the exposure of the un- 
known picture was demonstrated easily by 
the use of the symbol test. While with 
the steady picture, the symbol could not 
be recognized in 8 fields; if the exposure 
to the symbol was followed by a blank 
screen, even one field (1/60 of a second) 
was enough to pick the right symbol. 
Based on these tests it is believed, that 
the type of tests, where an image is 
flashed on and off by optical means, the 


128 


retinal retention is tested rather than 
the perception speed. 


One further test involved the 
instruction of the observer that the 
symbol to be shown is a choice of two of 
the symbols from Fig. , thus each 
recognition corresponding to 1 bit of 
information received correctly. The 
speed for this test was set at 1/30 of a 
second and we expected to perform a 
large number of tests in order to get an 
average, however the 30 bits per second 
provided perfect recognition every tine 
and accordingly ten tests were considered 
conclusive, 


The author gratefully acknowledges 
the help of Mr. R. Staffin with the tests. 
Mr. Staffin also constructed the timing 
switch. 


References 


loouriet, G.G.o, "A Method of Measuring 
Television Picture Detail", Electronic 
Engineering, Vol. 2h, 1952, p. 308. 


event Kretzmer, "Statistics of Television 


Signals", BSTJ, Vol. 31, 1952, pe 751. 


36.4. Harrison, "Experiments with Linear 


Prediction in Television", BSTJ, Vol. 31, 
1952, pe 76h. 


ho ox. Ogden: The Basic Words, Kegan Paul, 
Trench and Coe, London, 1933. 


Swebster's New Collegiate Dictionary, G 


and C. Merriam Co., 1953. 


HUMAN MEMORY AND THE STORAGE OF INF ORMATION™ 


George 


A. Miller 


Harvard University 


Cambridge, 


Abstract 


The amount of selective information in a 
message can be increased either by increasing 
the variety of the symbols from which it is 
composed or by increasing the length of the 
message. Psychological experiments indicate 
that the variety of the symbols is far less im- 
portant than the length of the message in con- 
trolling what human subjects are able to remem- 
ber. Two messages equal in length but differing 
in the amount of information per symbol are 
equally easy to memorize. This tact provides 
an opportunity for the effective use of recoding 
procedures and reveals the mental economy in- 
volved in organizing the materials we want to 
remember. 

An apparent exception to the rule that length, 
not variety, is the limiting factor in human 
memory occurs in the case of redundant mes- 
sages. If two messages of the same length differ 
because one contains redundancy familiar to the 
learner and the other does not, the redundant 
message will usually be easier to learn and 
remember. In terms of the theory of informa- 
tion, redundancy can be viewed equally well as 
a reduction in the information per symbol or 
as a reduction in the effective length of the 
message. Psychologically, however, these two 
alternatives are not equivalent; redundancy per- 
mits a reorganization into familiar sequences 
in a way that effectively shortens the length of 
the message and so makes it easier to memo- 
rize, but this is not psychologically equivalent 
to reducing the amount of 
symbol. 

It is as if each storage register could accept 
any one of a tremendous variety of alternative 
symbols, but the number of registers available 


information per 


* Preparation of this article was supported in 

part under Contracts AF 33(038)-14343, AFCRC 
TR 56-54 and Nonr 1866(15) (Project NR142-201, 
Report PNR-185). Reproduction for any purpose 
of the U. S. Government is permitted. 


129 


Massachusetts 


If we use these registers to 
store binary symbols, the storage is inefficient. 
If we group the binary symbols into sequences, 
give each sequence a different name, and store 
the recoded names, we can make much more 

efficient use of the registers. Familiar redun- 
dancy is helpful because it enables us to recode 


was quite limited. 


more efficiently. 

These results for human memory are all the 
more striking in view of the fact that the amount 
of information per symbol is a critically impor- 
tant variable controlling the accuracy of our 
perceptions. 


Introduction 


The development of a mathematical theory of 
communication has stimulated considerable in- 
terest among experimental psychologists in the 
measurement of human capacities to process 
selective information (8). If a human operator 
is regarded as a communication channel with 
stimuli for inputs and responses’ for outputs, it 
is possible to estimate maximum rates of trans- 
mission through him. The amount of input in- 
formation can be varied either by increasing 
the rate of presentation of the stimuli or by in- 
creasing the amount of information per stimulus. 
Or, if a human operator is looked ‘upon as a 
storage device, one can determine the conditions 
under which he learns or remembers the larg- 
est amounts of selective information. Most of 
these applications have used the discrete model 
developed by Shannon (13), since psychologists 
tend to regard stimuli and responses as discrete 
events; however, the continuous case can be 
used in the analysis of tracking behavior. 


Rates of Transmission 


Studies of the maximum rate of transmission 
yield values so low that one is forced to con- 
clude that the human nervous system was not 
designed with transmission as its major objec- 


tive. The practical upper limit under optimal 


conditions seems to be around 25 bits/second, 
and 40 bits/second is as high as anyone has yet 
claimed to be able to achieve. According to 
Quastler (12, pp. 341-349), there are two factors 
that limit the maximum rate. The first arises 
from the fact that people cannot respond more 
than about nine times/second, and this is effec- 
tively reduced to five responses/second when 
discrimination is required. A second limit in- 
trudes because people cannot discriminate rap- 
idly among more than about 32 equi-probable 
alternatives without becoming confused. If the 
task exceeds either of these limits, performance 
deteriorates rapidly. In these studies Quastler 
used skilled typists and pianists and concluded 
that 25 bits/second was the practical upper limit 
for typing and playing the piano. 

Fitts (4), using less familiar tasks like tap- 
ping alternately two targets, transferring discs 
from one pin to another, and transferring pins 
from one set of holes to another, also found a 
consistent upper limit to the amount of infor- 
mation that people could generate, but the limit 
was between 10 and 12 bits/second. The differ- 
ence between Quastler's limit of 25 bits/second 
and Fitts’ limit of 12 bits/second is due to the 
fact that Quastler'’s tasks used both arm and 
finger movements, whereas Fitts’ tasks involved 
only arm movements. Still different limits would 
probably result for other muscle groups. Such 
differences indicate that we must be careful to 
specify what segment of the total organism is 
involved in producing the responses before our 
statements of channel capacity are meaningful. 

Since the muscles are physically capable of 
responding much faster than ten contractions/ 
second, the limits observed in these experiments 
are presumably attributable to the fact that ex- 
citation and inhibition cannot be made to alter- 
nate rapidly in the central nervous system with- 
out losing precise control of the movement. The 
observed trading relation between speed and 
accuracy which results in the constancy of the 
informational measure is the result of the time 
taken for central organizing processes to occur. 

The inference that the bottleneck occurs cen- 
trally is supported by the fact that the rate of 
transmission of selective information is greatly 
reduced if there is an unnatural or unfamiliar 
relation between the stimulus and the response, 
For example, Fitts and Deininger (5) found that 
if the response was a movement toward a target 
light at the instant the light appeared, perform- 
ante was far superior to the case in which an 
arbitrary direction of movement was assigned 
This superiority persists 
Presumably 


for each target light. 
even after extended practice (6). 


the arbitrary relation between stimulus and re- 
sponse requires an additional recoding operation 
by the organism and the more complex this 

central transformation becomes, the slower the 
rate of transmission. Thus, the limit depends 

not only upon the discriminability of the stimuli 
and upon the particular response system em-~ 
ployed, but also upon the degree of congruence 
between stimuli and responses. Fitts has re- 
ferred to this fact as the principle of “stimulus- 
response compatibility.” 


Span of Absolute Judgment 


The limitations imposed on our capacity to 
discriminate among different stimuli can be 
studied in more detail if we remove the re- 
quirement that the operator must respond as 
quickly as possible. That is to say, we give 
the person unlimited time in which to respond, 
but ask him to be as accurate as possible. A 
stimulus is presented and the observer is re- 
quired to identify which one ofa set of alterna- 
tive stimuli it is. Under these conditions of 
judgment it appears that the complexity of the 
stimulus is an extremely important factor. If 
we take the simplest possible stimuli-- pure 
monochromatic lights--and vary them 
subjects can identify 
For such 


tones, 
along a single dimension, 
accurately from 5 to 15 alternatives. 
unidimensional stimuli, therefore, the discrimi- 
native capacity ranges between two and four 

bits/judgment. if the stimuli differ 
from one another in several dimensions, dis- 
criminations can be made simultaneously along 
The total discriminative ca- 
pacity is increased by increasing the number of 
dimensions of variation in the stimuli, but the 
accuracy of judgment on each individual dimen- 
sion decreases --there is a slight interference 

or “masking” 


However, 


each dimension. 


effect in multidimensional dis- 

With sounds that differed from 

one another on six different acoustic dimensions, 
for example, Pollack and Ficks (11) obtained the 
value of 7.2 bits/judgment. This is a large in- 
crease over the three bits/judgment obtained 

with unidimensional stimuli, but it represents 
a relative decrease to 1.2 bits/judgment/dimen- 
sion. 


criminations. 


Apparently we are designed to operate 
best when we must make relatively crude judg- 
ments of several attributes simultaneously. We 
are not able to make extremely precise identi- 
fications along a single dimension. Miller (9) 
has referred to this limitation as defining a 
“span of absolute judgment.” 


130 


This perceptual limitation is surprising when 
we remember that people can detect very small 
differences between two stimuli presented at 
the same time or in immediate succession. 
Relative judgments are far easier than absolute 
identifications. The limit does not seem to re- 
side peripherally in the receptor organs, but in 
a central process of judgment that is involved 
when we attempt to identify or name a partic- 
ular, individual stimulus. We are far more ac- 
curate in saying that two stimuli are different 
than we are in saying which particular stimuli 
they happen to be. Thus we are once again, as 
in the case of the muscles, forced to look to 
the central nervous system for the cause of our 
limited capacity to process information. 


Span of Immediate Memory 


Once we have removed the instruction that 
the person must respond as rapidly as possible, 
it is quite natural to go one step further and 
ask the person to withhold his response until 
several stimuli have been presented in succes- 
sion. Thus we move from measuring bits/sec- 
ond to bits/judgment and on to bits/sequence. 
To discrimination and response, we add the task 
of storage. The limitations of memory, however, 
appear to be a different sort than the limitations 
of discrimination or response. 

In the simplest test of mnemonic capacity, a 
sequence of symbols (usually decimal digits) is 
read aloud or shown to the person at a regular 
rate (usually one per second) and at the end of 
the sequence he is asked to repeat or write the 
symbols in the correct order. The experiment- 
er begins with short sequences and increases 
the length until the person is no longer able to 
repeat the entire sequence without error. This 
point is called the “span of immediate memory.” 
The procedure was adopted by Binet in his first 
scale for measuring mental age and has been 
retained in all subsequent revisions of the scale. 
The memory span is not a perfect measure of 
intelligence, however, since a long span does 
not necessarily indicate high intelligence. It 
has been retained because an unusually short 
span is a reliable indicator of mental deficiency. 

Inasmuch as the amount of information re- 
quired to specify the stimulus or to control the 
response sets the perceptual and médtor limits, 
it is natural to ask whether the span of imme- 
diate memory shows a similar invariance when 
information measures are applied. The answer 
is unequivocally negative. It is the length of the 
sequence, rather than the amount of information 
per item, that is the critical factor. For ex- 


ample, a person who can repeat nine binary 
digits will have a span of about eight decimal 
digits, seven letters’ of the alphabet, 
monosyllabic English words. These represent 


9, 25, 33, and about 50: bits, 


or five 


respectively. 


. Clearly, the memory span is more nearly in- 


131 


variant when we measure it in terms of the 
length of the sequence than when we measure 

it in terms of the amount of information stored. 
Whereas perception and response are limited by 
the number of bits of information, immediate 

memory is limited by the number of items or 

“chunks” of information (9). 

This fact leads to an insight into the eco- 
nomics of cognitive organization. Since it is as 
easy to remember a lot of information (when 
the items are informationally rich) as it is to 
remember a little information (when the items 
are informationally impoverished), it is econom- 
ical to organize the material into rich chunks. 
To draw a rather farfetched analogy, it is as 
if we had to carry all our money in a purse 
that could contain only seven coins. It doesn’t 
matter to the purse, however, whether these 
coins are pennies or silver dollars. The pro- 
cess of organizing and reorganizing is a per- 
vasive human trait, and it is motivated, at 
least in part, by an attempt to make the best 
possible use of our mnemonic capacity. 


Recoding 


The effect of reorganizing, 
input can be illustrated by a trick that computer 
engineers use to remember a long sequence of 
The sequence of binary digits is 


or recoding, the 


binary digits. 
first grouped into successive triplets and then 
each one of the eight possible triplets is trans- 
lated into a single octal digit: For’ example the 
sequence 010111001001000110 is grouped 010- 
111-001-001-000-110 and recoded into 271106. 
The original 18 binary digits far exceed the span 
of immediate memory, but the six recoded octal 
digits are easily remembered. After a little 
study of the binary-to-octal transformation, the 
engineers are able to deal with almost three 
times as much information as before. 

A great deal of our learning coricerns the 
development of these rules for reorganizing the 
input information. For instance, when a man 
first begins to receive Morse Code, each dit 
and dah seems to be an isolated item of infor- 
mation and he gets lost if he falls more than 
two or three letters (five to ten dits and dahs) 
behind the transmitter. As he learns to recode 
the dits and dahs into letters, then words, and 


later phrases, he is able to deal with larger 
and larger chunks of the message. An experi- 
enced operator may sometimes fall as much as 
ten words behind a familiar message without 
becoming lost; ten words represent about 150 
dits and dahs. The experienced operator is able 
to store that sequence of 150 items away in 
memory because he has organized them into fa- 
miliar groups in much the same way computer 
engineers organize binary into octal digits. 

It is possible to think of many of the great 
advances in human thought as discoveries of 
more economical ways to package information 
that must be stored in the mnemonic warehouse. 
The superiority of the Arabic over the Roman 
notation for numbers is a case in point, and 
many advances in mathematics--the calculus, 
matrix algebra, statistical theory, and probably 
many others--can be considered as providing a 
mnemonically better set of symbols for repre- 
senting the relevant aspects of a problem in 
such a way that we can grasp them in a single 
act of thought. It would be foolish to argue that 
mental economy was the only result of mathe- 
matical inventions, but certainly the discovery 
of a good system of notation is an important 
step toward new insight. It is interesting to 
speculate whether the construction of large com- 
puting machines with enough storage to make 
such recoding unnecessary will have any impor- 
tant effect upon mathematical creativity. Our 
growing capacity to solve problems by electronic 
computation before we have properly understood 
them has disturbed many pure mathematicians. 

It is also possible to argue that the value of 
natural laws is at least in part due to the fact 
that they summarize in a convenient formula a 
tremendous amount of information collected from 
individual observations in numerous experiments, 
The law of gravitation or the gas laws, for ex- 
ample, might just as well be summarized in 
tables with the experimental measurements tabu- 
lated under the appropriate experimental condi- 
tions; such tables would have the tremendous 
disadvantage that, although they contain exactly 
the same information,. we could not apprehend 
simultaneously all that they have to tell us. 

A less precise but more general technique 
for organizing our experience into convenient 
units is provided by language (7, pp. 223-237). 
When you witness a scene or hear a story that 
you want to remember, you try to translate it 
“into ycur own words,” into the linguistic units 
that will fit into your own cognitive hierarchy. 
This highly schematic, 
is remembered. Then when you try to recall 
you must decode. Since the fit of words to 


verbalized abbreviation 


132 


experience is seldom as tight as the fit of laws 
to data, the decoding process often goes astray. 
You supply details by secondary elaboration 
that are consistent with your coded memory. 
Often these details are wrong. Psychologists 
have been interested in such systematic distor- 
tions, because they become of practical conse- 
quence for legal testimony and the propagation 
of rumor (1). 


Rote Memorization 


These speculations indicate some of the im- 
plications of the fact that it is length, not va- 
riety, that imposes the major restriction upon 
immediate memory. Immediate memory, how- 
ever, is merely the simplest of our mnemonic 
functions. What happens when the amount of ma- 
terial that we must deal with greatly exceeds 
the span of immediate memory? 

One way that psychologists have studied how 
people assimilate large quantities of material is 
to ask them to memorize it and to record how 
much of the material they have mastered after 
each rehearsal. This procedure has the merit 
that it provides a reasonably objective quantifi- 
cation of the progress of learning. It can be 
used either with meaningful, connected text or 
with nonsensical, disconnected sequences. We 
shall consider both cases, taking first the sim- 
pler, though less natural, case in which the ma- 
terial to be learned consists of a sequence of 
symbols chosen independently from a given set 
of alternatives. 

Dr. Sidney L. Smith and I asked subjects to 
memorize long sequences of items in which each 
item was chosen from among either 2, 8, or 32 
alternatives (0 or 1, 0 through 7, or all the let- 
ters of the alphabet except Q plus the numerals 
3 through 9). Lists containing 10, 20, 30, and 
50 successive items were constructed for all 
three kinds of test material. Each list was 
presented to the learner a symbol at a time at 
a rate of one symbol/second and after the com- 
plete presentation he wrote down in the correct 
order as much of the list as he could remember. 
The same list was presented repeatedly until 
he was able to reproduce it all without error. 
Six people learned the entire set of lists. 

The lists were constructed in such a way that 
each item contained 1, 3, or 5 bits of informa- 
tion. If the difficulty of the task depends upon 
the amount of information involved, it should 
take just as long to memorize a sequence of 
ten items selected from 32 alternatives as it 
takes to memorize a list of 50 items selected 


from two alternatives. The averages of the num- 


ber of trials required in these two cases, however, 


were 2.5 and 12.2, respectively, a very reliable 
difference. Thus we can safely say that, even 
when the material to be remembered exceeds the 
-span of immediate memory, the difficulty of the 
‘task does not depend upon the amount of infor- 
mation that the material contains. 

If the difficulty of the task depends upon the 
number of items involved, it should take just as 
long to memorize a sequence of N items selec- 
ted from 32 alternatives as from eight or two 
alternatives. This prediction is much closer to 
the facts. The number of repetitions required 
to memorize the 32-alternative and 8-alternative 
lists were not significantly different for any of 
the lengths of list used. The sequence of binary 
items, however, was slightly easier and required 
about 20% fewer repetitions; this discrepancy 
is caused presumably by the fact that ‘binary 


sequences are especially easy to group and re-_ 


code. 
terial to be remembered exceeds the span of 
immediate memory, the difficulty of the task de- 
pends critically upon the length of the list. 
therefore, we may say that 
when the material to be learned does not form 


In general, however, even when the ma- 


To summarize, 


a familiar sequence, the difficulty depends pri- 
marily upon the length of the material and is 
relatively independent of the amount of informa- 
tion it contains. Under these conditions, it is 
just as easy to memorize a lot of information 
as to memorize little information. And this 
conclusion holds both for the span of immediate 
memory and for materials that exceed the span 
and must be repeated several times. Our con- 
fidence in this generalization is supported by 
the fact that Brogden and Schmidt (2, 3), working 
with quite different techniques and unaware of 
the hypothesis Smith and I were trying to test, 


obtained very similar results. 


The Unitization Hypothesis 


The fact that there is a limited span of im- 
mediate memory poses a paradox for psycholo- 
one might almost call it the central 
If we can 


gists; 
issue for studies of verbal learning. 
remember seven items without any trouble, why 
can’t we simply hold those seven and take on 

seven more? What is it about the second seven 
that makes us forget the first seven? A variety 
of explanations have been proposed. “Leaky 
bucket” hypotheses hold that a point is reached 
at which the old material leaks out as fast as 

the new is put in; that the memory traces fade 


133 


away until we start to forget as fast as we 
learn. “Cross talk” hypotheses hold that a 
point is reached at which the separate signals 
begin to interfere with one another; that we have 
competing tendencies to respond with different 
items instead of the correct item. “Sabotage” 
hypotheses hold that a point is reached at which 
new items begin to wreck the established mne- 
moni¢ machinery; that each item sets up an 
active inhibitory process that accumulates until 
it exceeds a critical level. “Standing room only” 
hypotheses hold that theré are only so many 
seats available in the mental amphitheater; that 
there is not time to organize the material prop- 
erly into supraordinate units in order to fit it 
into the available number of slots. 

Each of these several varieties of hypotheses 
has implications for a wide range of phenomena 
that have been observed in studies of verbal 
learning. This is not’ the place to develop them 
or to compare their relative merits. They are 
mentioned here only to indicaté that there is 
some divergence of opinion among psychologists 
and that the opinions of this author are not 
likely to represent the final word on this com- 
plex topic. The present exposition of the 
unitization, or “standing room only,” hypothesis, 
therefore, should be read with only one eye. 

Suppose we take quite literally the assumption 
that our memory is capable of dealing with 
only seven items at a time. It is as if we were 
dealing with a computing machine that has a’ 
small, fixed number of storage registers, Each 
register can accept any of a tremendous variety 
of different symbols, so the total amount of in- 
formation that can be stored is quite large. The 
design of this storage system, however, makes 
it necessary to recode the input in order to re- 
duce it to a small enough number of symbols. 
When the number of items in the input exceeds 
the number of registers, therefore, the items 
must be grouped according to some scheme of 
organization, new symbols must be chosen to 
represent the new groups, and these recoded 
symbols are then stored in the registers. The 
learning process consists largely of reorganizing 
and symbolizing the input until it is reorganized 
into supraordinate units sufficiently simple to 
fit into the machine, It is this grouping and 
naming that we shall call “unitization.” 

When a person submits himself to a psy- 
chologist who asks him to memorize some stupid 
and useless sequence of symbols, he probably 
unitizes the material in an ad hoc manner that 
is quite tentative and transient, but is adequate 
for the immediate purposes, When he sets out 
to learn something that he is personally inter- 


ested in and that he expects to have use for, 
however, he is probably much more careful to 
organize the material in a way that fits well into 
his established cognitive structure. Without the 
pressure of time, he can explore various alter- 
native unitizations until he finds one that works 
best for him and promises the best recall at 
any later date. In either case, however, his 
task is to create a hierarchy of units in sucha 
way that by recalling the few, informationally 
rich and suggestive units at the top of the hier- 
archy he can then recover the more numerous, 
more detailed items at the bottom. 

The importance of this organizing process is 
introspectively obvious, but it is quite difficult 
to get at experimentally. The behavioral system 
that shows the clearest hierarchical organization 
of small units into larger, supraordinate units 
is, of course, language. Ever since Plato ob- 
served that thought is the soul's discourse with 
itself we have been aware of the intimate relation 
between thinking and talking. Although we are 
far from understanding the precise nature of 
this relation, it does seem reasonable to assume 
that the hierarchical organization of language-- 
syllables, phrases, 
clauses, sentences, paragraphs--is not an acci- 


sounds or letters, words, 
dental pattern, but truly represents the preferred 
mode of operation of our mental machinery. 
This is not to suggest that the laws of grammar 
are the laws of thought; the fallacy of this argu- 
ment is exposed by the variety of grammatical 
laws in different languages. But the existence 
of some kind of hierarchical organization in all 
languages must not be ignored. The identifica- 
tion of the appropriate units would be as sig- 
nificant for psychology as the molecular theory 
was for physics and chemistry or the cellular 
theory for biology. 

Before the “standing room only” hypothesis 
can be of much value to us, however, we must 
somehow derive from it quantitative predictions 
that can be tested by experiments which psy- 
chologists are able to conduct. By way of illus- 
tration, we might develop some of the following 
(1) Any- 
thing that interferes with the recoding process 
(2) Since the rate 
of presentation of the material: can be too rapid 


as theorems from a basic postulate: 
will interfere with memory. 


to permit organization to occur, we would pre- 
dict that a slow rate of presentation would lead 
to learning in fewer repetitions than would a 
Further (3) the superiority of the 
slower rate would tend to disappear as the ma- 


fast rate. 


terial became more meaningful, since the or- 
ganization into units is easier with familiar text 
than with random sequences of symbols. Also 


134 


(4) since connected discourse fits into a scheme 
of organization that we have already learned, it 
should be much easier to learn than the same 
amount of nonsense. (5) The amount of organizing 
required increases as the length of the material 
increases, other things being equal, and thus 
long passages are harder to learn than short 
ones. (6) The process of organizing should be- 
gin at some clear focus or reference point in 
the material, usually the beginning or the end, 
so that the middle of the sequence should be 
learned last. (7) The particular confusions and 
mistakes that occur should be predictable from 
a knowledge of the method. of recoding that was 
used. (8) The learner's expectations will influ- 
ence the way he organizes, so that tests of re- 
tention which violate his expectations should 
lead to poor performance. All of these predic- 
tions are supported by experimental evidence, 
but that fact is not decisive, because they can 
also be accounted for in terms of other theories. 


Meaningful Mate rials 


A major difficulty encountered in any attempt 
to apply the unitization hypothesis to particular 
experimental data is the specification of the 
supraordinate units that the learners are using. 
It is fine to know that it is length rather than 
variety, chunks rather than bits, that limits our 
memories. But this conclusion would be more 
useful if we had a better way to recognize what 
size chunks were used. For example, we can 
usually repeat a 20-word sentence after hearing 
it once. How many items--100 letters, 30 syl- 
lables, 20 words, 6 phrases, 2 clauses, or one 
sentence--does this sentence contain? We know 
that it contains about 120 bits of information, 
because we have a definition of the bit that is 
independent of our subjective organization of the 
sentence. But the essence of the chunk is that 
it is imposed by the person. For example, 
someone who knew nothing of English except the 
alphabet would have to treat the sentence as if 
it were 100 units long, whereas someone who 
knows English well might deal with it as if it 
were six units long. We cannot define the unit 
of organization independently of the learner. 

One way to get at the problem of defining the 
unit is to use sequences of words constructed at 
different orders of approximation to English. 
The O-order approximation is constructed by 
selecting words at random from a dictionary, 
so that each word has an equal chance to occur 
in the sequence. The l-order approximation 
consists of words chosen randomly from con- 
nected text, so that each word occurs according 


to its natural probability in English discourse. 
The 2-order approximation consists of words 
selected in the context of one preceding word, 
so that each successive (overlapping) pair of 
words occurs with its natural probability in 
English discourse. One way to construct such 
sequences is to select a word at random and 
search through a text until this word occurs. 
Then take the word which follows it and search 
until this word occurs again. Then take the 
word that follows it, etc., continuing in this way 
until a passage of the desired length is obtained. 
An alternative method is to ask a person to use 
the word in a sentence. Then take the word 
that follows it. in his sentence and give this word 
to another person to use in a sentence. Then 
take the following word in his sentence, etc., 
until the passage is completed. The 3-order 
approximation consists of words selected in the 
context of two preceding words and, in general, 
the N-order approximation contains words chosen 
in the context of N-1 preceding words. As N 
increases the sequences begin to sound more 
and more like English. The following are some 
examples: 


regular monotone and attempted to recall them 
immediately afterward. As might be expected, 
it was found that the higher-order approximations 
were remembered best. The interesting result, 
however, was that most of the improvement had 
occurred by the time a 5-order approximation 
was reached, The introduction of contextual 
constraints extending beyond about five words 
added little to facilitate recall. It is the short- 
term dependencies that are most important. A 
5-order approximation is still nonsense. Appar- 
ently it is not important that the sequence of 
words have a meaning. It is sufficient that it 
does not violate familiar intraverbal connections 
extending over sequences of only four or five 
words. On the basis of this evidence, therefore, 
we might argue that the natural size unit for 
dealing with connected text averages about five 
words in length. 

Miller and Selfridge also included randomly 
selected segments of connected texts in their 
tests and found only a slight, insignificant dif- 
ference in recall. for 5-order, 7-order, and 
textual materials. Apparently this result was 
due to an unfortunate selection of textual ma- 


Betwixt trumpeter pebbly complication vigorous 


tipple careen obscure attractive consequences 
expedition pane unpunished prominence chest 
sweetly basin awoke photographer ungrateful 


Is to went biped the of before love turtledoves 


the spins and I of yard than ask went Greek 


yesterday 


Sun was nice dormitory is I like chocolate cake 
but I think that book is he wants to school there 


Family was large dark animal came roaring 
down the middle of my friends love books pas- 
sionately every kiss is fine 


Road in the country was 
dreary rooms where they have some books tc 
to buy for studying Greek 


insane especially in 


Said that he was afraid of dogs marked with 


white spots and with black spots covering it 


the leopard did 


Miller and Selfridge (10) constructed 0, 1, 2, 
3, 4, 5, and 7-order approximations to study 
how well such materials could be recalled. At 
each order of approximation, lists 10, 20, 30, and 
50 words in length were used. A group of 
people heard the words read aloud once in a 


135 


terials, for other experimenters have found con- 
nected text slightly easier to learn. It seems 
that meaning does add something to recall that 
is not contained in the nonsensical approxima~ 
This fact was clearly demonstrated by 
Levine 


tions. 
Marvin Levine in an unpublished study. 


used short anecdotes written to have exactly 
the same length as the other passages. The 
meaningful unity of the anecdote produced recall 
scores about 10% higher than he obtained with 
7-order approximations. Thus larger units can 
play a significant role in recall. Nevertheless, 
the fact remains that it is the short-term con- 
nections that produce the major effects. 


Redundancy 


If we view the results of the Miller-Selfridge 
experiment in the light of the preceding discus- 
sion, they seem at first glance to violate our 
general thesis that it is length, not variety, that 
makes a sequence of symbols difficult to memo- 
rize. The O-order approximation is less redun- 
dant and so contains more information per word 
than does the 7-order approximation, and it is 
correspondingly more difficult to remember. 
Although the data do not permit a precise cal- 
culation, it is approximately true that the same 
amount of information was recalled with 50-word 
passages independent of the order of approxima- 
tion to English. If these were the only data 
available, therefore, we might conclude that it 
is the amount of information, not the length, 
that is the critical variable. 

However, the Miller-Selfridge results do not 
give a decisive answer to this question. It is 
indeed true that the O-order approximation con- 
tains more information per word, but it is alsc 
true that it is psychologically longer than the 
same number cf words ina 7-order approxima- 
tion. | 

That is to say, each word in a O-order 
approximation is a separate unit, whereas suc- 
cessive words in a 7-order approximation can 
be grouped into familiar, supraordinate units. 
Thus the experiment confounds the two variables, 
amount of information and number of units, that 
we are attempting to separate. On the basis of 
these data alone we cannot say conclusively 
whether it is the greater number of units or the 
greater number of bits that makes the O-order 
approximation hard to recall. 

Shannon (14) has estimated that English is 
about 75% redundant. This redundancy provides 
a margin of safety for our perceptual and motor 
processes. It enables us to mis-speak or to 
mis~hear the details of a message and to cor- 
For 


rect our errors on the basis of context. 


the purposes of remembering, however, this 
redundancy would seem to be very inefficient. 
If it is length that burdens our storage facilities, 
why do we deliberately make our messages four 
times as long as would be necessary if we used 
our alphabet with maximum efficiency? 

This question springs from an easy confusion 
between information theory and psychological 
theory. Within the framework of information 
theory it is true that redundancy can be treated 
either as (1) a reduction in the information per 
symbol, or (2) a reduction in the effective length 


of the message. Psychologically, however, 


_these two alternatives are not equivalent; the 


136 


fact that the message is relatively predictable 
enables a man to organize it into supraordinate 
units, and it is these organized units, corres- 
ponding roughly to what we call “ideas,” that 
he stores away in memory. Information theory 
does not say that 75% of our ideas are redun- 
dant (though this may indeed be the case), but 
only that the same ideas could be encoded in 
25% as many letters. In terms of psychological 
units, our messages are not four 
times as long as necessary to the person famil- 


therefore, 


iar with the organization of the language. 

On the basis of the available evidence it 
seems that memory, unlike sensory-motor 
processes, is not limited primarily by the 
amount of information it must process. Memory 
is limited by the number of psychological units 
it must handle; under appropriate recoding 
transformations these units can come to repre- 
sent large amounts of information. In one 
sense, this is a disappointing conclusion. Re- 
search would be easier if the well-defined bit 
could be used to measure the capacity of the 
memory. Unfortunately for psychologists, the 
human organism was not designed for the con- 
venience of researchers. The problem of de- 
fining the size of the chunk of information that 
people treat as a unit has not.been solved by 
information theory, and the psychologist's task 
remains as difficult as ever. But on the com- 
forting side of the picture, information theory 
has helped to clarify the purposes behind our 
persistent classifying and unitizing of experience. 
It is informationally profitable, both for com- 
munication and memory, to organize and sym- 
bolize. For this reason, the measurement of 
selective information has become an important 
tool in psychological research. 


Ne 


References 


Allport, G. W., and L. J. Postman. The Psy- 


chology of Rumor. New York: Holt, 1947, 
Brogden, W. J., and R. E. Schmidt. The ef- 


fect of number of choices per unit of a ver- 
bal maze on learning and serial position 
errors. J. exp. Psychol., 1954, 47, 235-240. 


Brogden, W. J., and R. E. Schmidt. Acqui- 
Sition of a 24-unit verbal maze as a function 
of the number of alternative choices per unit. 


J. exp. Psychol., 1954, 48, 335-338. 


Fitts, P. M. The information capacity of the 
human motor system in controlling the am- 


plitude of movement. J. exp. Psychol., 1954, 
47, 381-391. 


Fitts, P. M., and R. L. Deininger. S-R com- 


. patibility: Correspondence among paired el- 


ements within stimulus and response codes. 
J. exp. Psychol., 1954, 48, 483-492. 


Fitts, P. M., and C. M. Seeger. S-R com- 
patibility: Spatial characteristics of stimulus 


and response codes. J. exp. Psychol., 1953, 
46, 199-210. 


Miller, G. A. Language and Communication. 
New York: McGraw-Hill, 1951. 


33 


8. 


10. 


ie 


Ws 


Wiehe 


14, 


. Miller, G. A. 


Miller, G. A. What is information measure- 
ment? Amer. Psychologist, 1953, 8, 3-11. 


The magical number seven, 
plus or minus two: Some limits on our ca- 
pacify for processing information. Psychol. 


Rev., 1956, 63, 81-97. 


Miller, G. A., and J. A. Selfridge. Verbal 
context and the recall of meaningful mater- 
ial.” Amer. J. ‘Psychol=; 1950, 63, 176-1852 


Pollack, I., and L. Ficks. Information of 
elementary multi-dimensional auditory dis- 
plays. J. acoust. Soc. Amer., 1954, 26, 
155-158" 


Quastler, H. (Ed.) Information Theory in 
Psychology. Glencoe: Free Press, 


WS YSIS). 


Shannon, C. E. A mathematical theory of 


communication. Bell Syst. Tech. J., 1948, 
Path, SUI CAS) 

Shannon, C. E. Prediction and entropy of 
printed English. Bell Syst. Tech. J., 1951, 
30, 50-64. 


THE HUMAN USE OF INFORMATION 


LED. 


DECISION-MAKING IN SIGNAL DETECTION AND 


RECOGNITION SITUATIONS INVOLVING MULTIPLE ALTERNATIVES™ 


John A. Swets and Theodore G. Birdsall 
University of Michigan 


Summary 


A general theory of signal detectability, 
constructed after the model provided by decision 
theory, is applied to the performance of the human 
observer faced with the problem of choosing among 
multiple signal alternatives on the basis of a 
fixed, finite observation interval. An extension 
of the theory, previously reported, of deciding 
among two alternatives, is developed in detail, 
permitting the treatment of the two simple cases 
involving multiple alternatives that are studied 
experimentally. 


In both cases, a priori. probabilities are 
assigned to the occurrence of the relevant signal 
alternatives, and values are assigned to the poss- 
ible decision outcomes. This assignment permits 
definition of the expected value of a decision, 
and specifies as the optimum decision criteria 
those that maximize expected value. The experi- 
ments are primarily concerned with determining 
the ability of the human observer to successively 
establish optimum decision criteria in accordance 
with changes in the a priori probabilities and 
risk functions. 


The experimental results are portrayed in the 
form of comparisons of the payoff obtained by the 
observer and the theoretically maximum vayoff 
attainable, and comparisons of the response-fre- 
quency tables generated by the theory and by the 
observer. The results indicate that a highly 
simplified theory is adequate for prediction of the 
obtained payoff and response-frequency tables to 
within a few percent. They also indicate the fair- 
ly large extent to which intelligence may influ- 
ence a sensory process usually assumed to involve 
fixed parameters. 


Introduction 


Several experiments previously reported h@ve 
supported the applicability of the model provided 
by the theory of statistical decision, or the 
theory of testing statistical hypotheses, to the 
behavior of the human observer in signal detec- 
tion and recognition situations. In particular, 
a theory which assumes the human observer to be 
capable of adapting to any of a number of optimum 
decision rules has been shown experimentally to 
constitute an adequate description of his be- 
havior in signal reception problems involving two 
signal alternatives or hypotheses. 


*~ This work was sponsored by the U. S. 
Signal Corps. 


Army 


This paper is concerned with more complex 
situations. It reports an extension of the de- 
cision-making theory of detection in which opti- 
mum behavior is specified for two situations, 
each involving three signal hypotheses. The re- 
sults of experimental tests of the extended 
theory, employing auditory signals embedded in 


. noise, are also presented. 


The relevance of the theory of statistical 
decision to the general theory of signal cee 
bility has been discussed in other papers (3,14,15) 
The application of the mathematical theory of sig- 
nal dectability to the behavior of the human obser- 
ver in several different detection and recognition 
problems, involving both visual and auditory sig- 
nals, ae algo been the subject of a series of 
papers 1,6,6,9,10,11,12,13), The next section 
summarizes those aspects of the fapers referred to 
that are basic to the experiments reported here. 

In addition, the next section elaborates for the 
first time the theory specific to the present ex- 
periments. A more general treatment of the multi- 
ple- vepeyeets theory is that of Middleton and Van 
Meter 415), the development of the next section 
is somewhat simpler in that it deals only with 
the specific three-hypothesis cases studied exper- 
imentally. 


The Decision-Making Theory of Detection 


Two assumptions are primary in the applica- 
tion of the theory of signal detectability to the 
behavior of the human observer. The first is 
that sensory systems are basically communication 
channels, conveying information about environmen- 
tal events to higher centers. It is supposed 
that, at the higher centers, this sensory infor- 
mation combines with a priori information to 
serve as the basis for decisions about the events 
initiating the sensory information. The second 
assumption is that the sensory systems are noisy 
channels, that is, that they themselves contin- 
uously generate irrelevant, random activity that 
is not easily distinguishable from the activity 
initiated externally. In other words, partially 
degraded information about immediate environmen- 
tal events is displayed at the higher centers; 
here the display is observed and a decision is 
made. 


The Theory of Testing Two Hypotheses 


In the fundamental detection problem, a func- 
tion of a fixed time interval is observed. The 
observer, in effect, is asked to decide whether 
the observation arose from noise alone or from 
signal-plus-noise, where the signal is known to 
be from a certain ensemble of signals. In terms 
of actual practice, the observer is asked to 


138 


state, after each presentation of the fixed time 
interval, which may or may not have contained a 
signal, either "yes, a signal was present" or "no, 
no signal was present." 


_ It is assumed that the space of all possible 
observations contains families of nested sets, 
which are here called "criteria." For a given 
signal ensemble and type of noise only one of 
these families is employed. Ina specific situa- 
tion, the "yes" or "no" response after a particu- 
lar observation is determined entirely by whether 
or not the corrupted observation falls inside a 
specific criterion. A less frequent"yes" res- 
ponse is obtained by using a criterion which is 
a subset of the first one. Since the criteria of 
a given family are nested, they can be simply 
ordered, and therefore this ordering is viewed as 
contained in a linear continuum. Of course, a 
change of signal ensemble or noise type may 
change the family of criteria, and thus change 
the linear continuum. However, the assumption 
that the observations are in nested criteria is 
sufficient to allow the observations to be rep- 
resented along a continuous axis. 


As implied in the discussion of the pre- 
ceding paragraphs, the observations resulting 
from noise alone are regarded as randomly dis- 
tributed, as are the observations arising from a 
signal of a given strength plus noise. The mean 
of the signal-plus-noise distribution is assumed 
to be a monotonic increasing function of signal 
strength. If each observation may arise from 
noise alone, it is possible to distort the 
linear continuum, maintaining only order,so that. 
the distribution of observations (on the con- 
tinuum) is Gaussian with zero mean and unit 
variance. This must be qualified to allow gaps 
in the distorted continuum if, in the original, 
there were points of positive probability. 


This conception, when the signal-plus-noise 
(SN) distribution is simply a translation of the 
noise-alone (N) distribution, is depicted in 
Figure 1. The abscissa represents the observa- 
tion, x; the ordinate represents the probability 
that a given observation will result from noise 
alone and from signal-plus-noise. Although un- 
necessary, it may be helpful to conceive of the 
observation variable as some measure of neural 
activity. The noise assumption (neural activity 
independent of external stimulation) is consis- 
tent with present knowledge of neurophysiology. 


The Nefinition of Criterion and Likelihood 
Ratio. According to the conceptual scheme of 
Figure 1, the observation variable is continuous, 
and any given observation may arise either from 
noise or signal-plus-noise. For this problem, 
the observer must establish a level of confidence, 
or a boundary of the criterion, and base his de- 
cision on the relation of the observation to this 
boundary. In terms of the theory, the observer 
chooses a set of observations (the criterion A) 
such that an observation in this set will lead 
him to Accept the hypothesis SN, that is, to say 
that a signal was present. All other observa- 


139 


tions are in the Complement of the criterion, 
CA; these are regarded by the observer as rep- 
resenting noise alone. The criterion A, with 
reference to Figure 1, consists of the values of 
x greater than some critical value. 


If the number of possible observations is 
countable, the conditional probability that sig- 
nal-plus-noise will yield a given observation is 
denoted Poy(x); the probability of this obser- 
vation, given noise alone, is denoted Py(x). The 
ratio of these two probabilities defines the like- 
lihood ratio, £(x) = Poy(x)/Py(x). Here the ob- 
servation space is considered continuous, and, 
accordingly, the probability density functions 
fay(x) and fy(x) are used; fgy(x) corresponds to 
the curve labelled SN in Figure 1 and f(x) is 
the curve labelled N. In this case, #(x) = 
fgn(x)/fy (x). A decision concerning signal exist- 
ance is reasonably based on this quantity, the 
relative likelihood that the observation x 
arose from signal-plus-noise and noise alone. 
According to the theory, for every x, the ob- 
server can estimate !(x). 


A criterion is conventionally evaluated in 
terms of the integrals of the density functions 
over the criterion A, since the integral of 
f, (x) over A is the conditional probability of 
aetection, Poy (A), and the integral of fy(x) over 
A is the conditional probability of a false alarm, 
P(A). A plot of Pay (A) vs. Py(A) as the decision 
criterion, or eritexton boundary, varies is shown 
in Figure 2; this plot is based on the assumption 
that the SN and N distributions are Gaussian and 
of equal variance. A family of curves is shown 
in this figure; the parameter, d', is an index of 
signal strength. In particular, d' is defined as 
the difference between the means of the SN and N 
distributions normalized to the peer a Ae ace 
of the N distribution, that is, ad‘ = - ° 

0. 


N 
These curves, showing Poy(A) as a function of 
Py(A)," are called "detectability curves". 


Given a particular signal strength, the ob- 
servercan, according to the theory, operate at 
any point on the associated detectability curve. 
Since Poy(A) and Py (A) are dependent probabilities 
(an increase in one of these requiring, or result- 
ing in, a determinate increase in the other), a 
given operating level will be more or less appro- 
priate for a given set of external conditions and 
a given_purpose. The theory of signal detecta- 
bility(3) specifies the optimum operating level, 
or criterion, under each of several definitions 
of optimum. In each case, the optimum criterion 
is in a family of criteria defined in terms of 
likelihood ratio. A criterion in this family is 
denoted A(8); that is, the criterion A contains 
all observations with likelihood ratio greater 
than 6, and none of those with likelihood ratio 
less than B. With respect to a given definition 
of optimum the exact value of B to be used con- 
stitutes its solution. The slope of a detecta- 
bility curve at the point corresponding to the 
optimum operating level is equal to this value of 


Definitions of Optimun Criteria. Within 


the theory of signal detectability, several de- 
finitions of optimum are advanced, along with 
their respective solutions. These may be listed 
as follows: 


1) The Weighted-Combination Criterion - the 
criterion that maximizes Poy(A) - w Py(A). Solu- 
tion: A(B) where Bp = w. 


2) The Ideal Criterion - the criterion that 
minimizes total error. Solution: A(®) where B = 
P(N 


P(SN) ’ the ratio of a priori probabilities. 


3) The Expected-Value Criterion - the cri- 
terion that maximizes the total expected value 
where the individual values are: 


Voy.q = the value of a detection 


Vyeca the value of a correct rejection 
Koy-ca = the cost of a miss 


Ky-a = the cost of a false alarm 


Solution: A(B) where B = 
P(N) . “Neca + yea 
P(SN) Voy. + Kgneca 


4) The Neyman-Pearson Criterion - the cri- 
terion that maximizes Poy(A) while Py(A)< ies 
Solution: A(f®) where Py[A(B)] = Ke 


5) The Information Criterion - the criterion 
that maximizes the he age in uncertainty, in 
the Shannon sense (7), as to whether or not a 
signal was sent. Solution: A(f) where 


B= a » log Pca(gy(N) - log Pagy(N) 
Eat) alog OsTED) = log Poq(g) (SN) 


Conclusions Drawn from Previous EF riments 


From the results of previous experiments, 
it may be stated that the observation variable, 
x, is continuous; the observer can distinguish 
among values of x well into the noise (9,%,11, 
12,13). Previous experiments have also shown the 
observers to be capable of operating in accord- 
ance with certain of the optimum criteria listed. 
In particular, the observer can imize the ex- 
pected value of the decision (° Teer bos SF and 
he can act as a Neyman-Pearson Observer (°) It 
is likely that the observer can act in accordance 
with Weighted-Combination and Ideal Criteria, 
since the Expected-Value Criterion and:the Ideal 
Criterion are special cases of the abstract 
Weighted-Combination Criterion. The observer can 
also adapt to another definition of optimum ad- 
vanced by the theory of signal detectability, one 
cabs to reporting a_posteriori probability 
(°); in this case no criterion is assumed, rather 
the best estimate is made of the probability that 
the observation arose from signal-plus-noise; 


_ Theory of Testi 


‘of one of the given hypotheses. 


140 


Py(SN) = £(x) P(SN)/£(x) P(SN) + P(N). Certain 
conclusions drawn from previous experiments are 
more conveniently discussed below. 5 
Multiple theses 

Given the demonstration that human observers 
are capable of behavior very nearly optimum in 
the two-hypothesis task, the present experiments 
were designed to assess their ability to behave 
optimally in more complex tasks. 


In both of the experiments reported below, 
the observers are required to decide among three 
hypotheses. In one case, the observer chooses, 
on each trial, on the basis of a single observa- 
tion interval, among the hypotheses: Signal One 
occurred, Signal Two occurred, Noise Alone was 
present. In the other case, three hypotheses 
concerning the time of signal occurrence are test- 
ed on each trial: the signal occurred in the 
first observation interval, in the second obser- 
vation interval, in the third observation inter- 
val. 


In these three-hypothesis tests, as in the 
two-hypothesis tests described above, a criterion 
approach is required of the observer. He must 
assume criteria such that each possible value of 
the observation variable, x, leads to acceptance 
In both experi- 
ments, the truth of each of the relevant hypo- 
theses is assigned an a priori probability, and 
values are assigned to the possible outcomes of 
the various decisions. This assignment permits 
definition of the expected value of a decision, 
and specifies as the appropriate optimum decision 
rule the one that maximizes the expected value 
of a decision. 


In the experiments, the probabilities and 
values are varied from one group of trials to 
another. For each group of trials involving 
constant values and probabilities, the criteria 
assumed by the observer are compared with the 
criteria specified as optimum within the theory. 
For simplicity of analysis, in the present exper- 
iments, positive values were assigned to correct 
decisions and a value equal to zero was assigned 
to each incorrect decision. 


By definition, the expected value of a deci- 
sion is the product of probability and desirabil- 
ity of an outcome summed over the possible out- 
comes. The probability of an outcome in this case 
is the probability of the joint event involving 
the signal presented and the signal hypothesis 
accepted, P(i-Aj). Since P(itAj) = P(i)P, (Aj), 
for the three-hypothesis test the expected value 
is defined by 
EV = 2 3(1) V5 P; (A; ) (1) 

1= 


where P(i) is the a_priori probability of occur- 
rence of the ith signal, V; is the value of cor- 
rectly identifying the occurrence of the ith 


signal, and P,(A,) is the conditional probability 
of accepting the occurrence of the ith signal 
when it occurred. In terms of the density funct- 


ions. 
3 
EV=2 P(i)v, f f, (x)dx (2) 
i=l Ay 
or 
EV S f P(i) Vf, (x)dx. (3) 
i=l A, 


Hence, the optimum decision performance requires 
that the observation, x, be regarded as belong-= 
ing to the criterion A,, (i.e., xeAj), if and 
only if 


P(4) V3, (x)>P(I)V5f (x) (4) 
1K LWNIL ala 


The development immediately above applies 
to both experiments. In the first section be- 
low, the theory is developed for the case of one 
of two signals or noise alone, (hereafter refer- 
red to as the "detection-and-recognition task") 
and in the second section below, for the case 
of one signal in one of three observation inter- 
vals (to be referred to as the "forced-choice- 
in-time task"). 


The cification of the timum Decision 


Criteria for the Detection-and-Recognition Task. 


This section presents the method of deriving the 
specification of optimum performance to which 
the observer's performance is compared in the 
detection-and-recognition task. 


Let f(x) = fg n(x), fo(x) = Pson(x), and 
f£3(x) = fy(x), and call Aj = B. 
Then 
xeB <> P(N) Vif y (x) > P(S3)Vgjfs,N(x), 5 Re Be 
(5) 


that is, x is in B if and only if 


P(N) Vy 


Pearl 
PS, )Vs, V5; 13(x) = Fs yn) _ Pony eRe aK(G) 


fy (x) 
Similarly, 
xeAp <> P(S2)Vg.fg n(x) > P(N)Vyty(x) EG?) 
ea P(S))Vg, Fg, n(%), (8) 
that is, 
Reha <> a(x) > des es (9) 


.in; Figure 3. 


ea a ee es (10) 
gusts 


and 


Vv 
ae} 
~- 
= 
— 
= 
2 
—- 
| oe 
kr 
~— 


xeA, > £1 (x) 
as P(S2)Vg, Bee (12) 


In the detection-and-recognition experiment 
reported below, Sj, So» and N were presented with 


equal probabilities, that is, P(S)N) = P(S,N) = 


P(N) =1. the optimum criteria varied as a 

3 
function of changes in the value or payoff matrix. 
Thus, the optimum decision criteria may be speci- 
fied as follows: 


x¢B <>1,(x) > yy » for either j, (13) 
Vs, 
and 
xeA) <=> x¢B and (x) > 2 fo(x). (14) 
Ss 
1 


Equations 13 and 14 permit the assignment 
of optimum frequencies of response in a 3 x 3 
table where the columns represent the true hypo- 
thesis and the rows represent the hypothesis 
accepted, providing the relevant detectability 
indices and values are: known. Just as a decision 
between two hypotheses can always be represented 
on a single axis with a normalized Gaussian dis- 
tribution under noise alone, the decision among 
three hypotheses can be represented on @ plane 
(two axes) with a normalized Gaussian distribution 
under noise alone. Consider the model represented 
If the signal-plus-noise distribu- 
tions are merely translations of the noise dis- 
tribution on the two axes, then the three hypo- 
theses can be represented as three points on a 
plane in which each distribution is Gaussian in 
every direction with unit variance. The criterion 
boundaries are straight since the method of as- 
signing values to the various outcomes of a de- 
cision leaves only a single density function in 
the integral to be maximized (in Equation 4), not 
linear combinations of density functions. The 
prediction of the response table then depends 
upon the distance of the boundaries from the 
points. Said another way, £,(x) is_constant on 
any line perpendicular to the line NS), and (x) 


is constant on any line perpendicular to the line 
The ratio of likelihood ratios, £;(x) and 


fo(x), is also constant along any line perpendicu- 
lar to the line S|So . Because the equalities of 
(13) and (14) hold on optimum boundaries, the 
three boundaries meet at a point. A knowledge of 


the detectability index d', for each pair of hy- 
potheses, that is, of the lengths of the sides of 
the triangle, and a knowledge of Vs) and Vs, ? 


permits the specification of the optimum response 
table .* 


The method actually used to determine the 
optimum response probabilities entailed drawing 
the boundaries on double probability paper ruled 
in 1% steps on each axis. Each rectangle repre- 
sents a probability of .01% for a two-dimensional 
Gaussian distribution centered at the center of 
the paper. Three graphs are drawn for each ob- 
server and each set of values, each graph assum- 
ing the truth of one of the three hypotheses. 
The rectangles within the various boundaries on 
each graph are counted; thus each graph yields 
a column of numbers in a 3 x 3 table. This pro- 
cess provides an approximation good to better 
than 1%. This procedure is described in more 
detail in the appendix. 


The Specification of the Optimum Decision 


Criteria for the Forced-Choice-in-Time Task. As 
indicated above, the optimum decision function for 
the case where multiple hypotheses exist is 
acceptance of the ith signal when and only when 


P(i)Vst5 (x) > PCA)VSf 5 (x) (4) 


ieoye TWWIL Alc 


In the routine use of the forced-choice-in- 
time technique, P(i) = P(j) and V, = V, for all 
i and j; in this case only the density*function 
f;(x) need be considered. In this case, optimum 
behavior requires that the observation interval 
yielding the greatest valye ie f,(x), or the 
greatest value of é(x) = 7SN x) » be selected as 

fy(x 

containing the signal. If the signal distribution 
is a translation of the noise distribution, the 
probability of a correct response given selection 
of the largest value of £7); or the probability 
that the largest value of f; (x) represents signal- 
plus-noise, is the probability that one drawing 
from a normal distribution with mean d* and unit 


*a previous study reported by Tanner (20) made use 
of this model of a signal plane where the signal 
distributions are simple translations of the dis- 
tribution of noise alone. In that study, the hy- 
potheses were treated pairwise, and the “angle” 


1\2 2 2 
between the signals, arc cos (aj) i: (a5) i (afoy, 


tart 
edsas 
was studied as a function of frequency separation 
and duration. The choice of signals —— 900 cps, 
1000 cps, - sec. duration —~ used in the detect- 


ion-and-recognition experiment reported below was 


variance is greater than the greater of two draw- 
ings from a normal distribution with zero mean 
and unit variance, where d', it will be recalled, 
is the difference between the means of the sig- 
nal-plus-noise and noise distributions normal- 
ized to the standard deviation of the noise dis- 


tribution, ete . The probability that a 
N 


correct answer will result for a given value of 
a', for the test involving three observation in- 
tervals, is given by the equation: 


+00 
f 


=O 


0°(x)y(x-d' )ax (15) 


P(c) 


where 0(x) is the area of the noise distribution 
to the left of x and yWx-d') is the ordinate of 

the signal-plus-noise distribution. P(c) vs. a! 
for the three-hypothesis task is plotted in Fig- 
ure 4. 


Forced-choice-in-time tests such as the one 
just described, that is, where P(i) = P(j) and 
V; = Vy for all i and j, have been performed 


i 
prior to the present experiment. The congruence 
of the estimates of d' from such tests and from 
several two-hypothesis tests involving variable 
values and probabilities, in which optimum be- 
havior was demonstrated, indicates that the ob- 
server does operate with the optimum decision 
function in the forced-choice-in-time test. (2) 
Data from other experiments requiring second 
choices (8,11), and, in another instance, requir- 
ing last choices (13)from the observers in the 
four-alternative, forced-choice-in-time task, 
demonstrate that the observation variable is con- 
tinuous; observers are capable of ordering 
values of this variable well into the noise. 
Until the present experiment, however, the 

first involving unequal values and probabilities, 
no direct evidence concerning the ability of the 
human observer to establish optimum criteria in 
the forced-choice-in-time task was available. 


Consider the case represented by the exper- 
iment reported below. In this experiment, two 
conditions were employed. Throughout the first 
condition V, = Vo = V3 the probabilities were 
varied from one group of observations to another 
under the restriction that P(1) = P(2) + P(3). 
Throughout the second condition P({1) = P(2) = 
P(3); here the values were varied under the re- 
striction that V]=Vo#¥2. In developing the 
method of specifying Hes optimum behavior in 
forced-choice-in-time tasks, these two conditions 
may be considered together. 


The effects of 8 priori probabilities and 
values on the solution for the expected-value 
definition of optimum in this task are combined 
in a single parameter, w, where 


y= oe 
P(1)+P(2 


(16) 


based on a desire to have a separation large 
enough to yield a signal angle of almost 90°. 


142 


Note that w is defined similarly to the critical 
value of likelihood ratio, B, for the two-hypo- 
thesis task as discussed above. 


Recalling that the optimum decision func- 
tion.is to select the greatest value of 
P(i) V,f,(x), we may regard equivalently 


£1 (x) £4 (x) 
f,,(x) or Lo(x) 
Qw f(x) Qw £3(x) 
log, t, (x) 
or Loge £5 (x) 


1og.(x) + log.2w 


as representing the observations associated with 
the three observation intervals. Now, from Ref.3, 
Part II, Sections 4.7 and 4.9, if the logarithm 
of likelihood ratio is normally distributed in 


2 
noise alone, then the mean is as the 


variapce is d'; in signal-plus-noise the mean is 
+(a! : 
as and the variance is d'. Using the nota- 


tion Hy: (m,o) to indicate a normal distribution 
with mean m and standard deviation go, 


2 \2 
log, 4 is in N: ' a a > a") 


2 
5 aly) 
t 
and in SN: + (at) ? ar) 
E . 
and therefore, 
2 
‘ 
log, + cae} = 
e is in N: (0,1) 
at (18) 


and in SN: (ae, 2) =(d',1). 


Therefore, a convenient form for the deci- 
sion components is the following: 


a) 
d,(x) = cpr tes + (ary*] ine Lye(19) 
2 
a? 
d3(x) = e froze #3(x) + a + 1og¢2¥ |(20) 
and if we introduce the perameter 


143 


(21) 


then 
2} 


(at) 


d(x) = ; [1og, #3 (x) eee ere pee (22) 


That is, three decision components may be consid- 
ered: d(x) and d,(x) are (0,1) in N and (d',1) 
in SN, a3 (x) is (k,1) in N and (k+d',1l) in SN. 
Optimum performance requires choosing the inter- 
val with the largest associated d, (x). 


Now the probability that d2(x) is greater 
than d(x) and d(x) when the signal is presented 
in the third interval is the probability that one 
drawing from the normal distribution (d'+k,1l) is 
greater than two drawings from the normal distri- 
bution (0,1). This probability may be determined 
from Figure 4, since this probability is the same 
as the probability of a correct response in a 
three-hypothesis test with equal values and equal 
probabilities if d' were k larger. The probabil- 
ity that do(x) is greater than d,(x) and a(x) 


when the signal occurs in interval two is the 
probability that one drawing from the normal 
distribution (d,1) is greater than one drawing 
from the distribution (k,1) and one drawing from 
the distribution (0,1). By symmetry, this is 
also the probability that d(x) is greater than 
do(x) and d(x) when the signal occurs in inter- 
val one. ese three probabilities, for a given 
d' and a given w, may be determined from the plot 
of Figure 5. In this figure the ordinate is 
P}(Ay) = Po(Ao), the abscissa is 1-P3(A3) = 


P3(A)YA,), and d' and w are parameters. As in 


the case of the two-hypothesis tests, the opti- 
mum operating level is that point on a given 
detectability curve where its slope is w. 


Ihe probability that d(x) is greater than 
din (x) when the signal occurs in the second inter- 


“val (or, by symmetry, the probability that do (x) 


is greater than ay (x) when the signal occurs in 


the first interval) may also be computed; this is 
the probability that one drawing from the normal 
distribution (0,1) is greater than one from the 
distribution (d',1) and one from the distribution 
(k,1). In Figure 6, P)(A,) = Po(A5) is plotted 
against P|(A,) = P,(A;) with d' and w as para- 
meters. 


The development just given permits the as- 
signment, for any da‘ and any w, of optimum fre- 
quencies of response in a 3 x 3 table where the 
columns represent the true hypothesis and the 
rows represent the hypothesis accepted. Thus, as 
w, or P(3) relative to P(1) and P(2), or V3 rela- 
tive to V) = Vo, is varied, it is possible to 


compare the observer's actual performance and the 
optimum performance for an observer characterized 


by his at.*- 


The actual computation of optimum response 
frequencies is conveniently based on a geometrical 
model similar to the one described above for the 
detection-and-recognition task. Consider a three- 
dimensional space, with the three hypotheses rep- 
resented by points along the axes at distance a’. 
This space is depicted in Figure 7. Since noise 
alone is never presented in the forced-choice-in- 
time task, the present interest lies in the slop- 
ing plane of the triangle drawn with dotted lines; 
this is the plane of Fig. 8. As in the case of 
the detection-and-recognition task, boundaries 
are drawn on double probability paper, and the 
rectangles are counted. (See the appendix). 


A Contrasting Prediction of Decision-Making Theory 
and Threshold Theory for Three-Hypothesis Tests. 


According to conventional sensory theory, 
there exists a lower bound on the boundary of a 
criterion (a "threshold") at roughiy 3 sigma up 
from the mean of the noise distributicn, that is, 
at a point on the observation continuum that is 
rarely exceeded by noise alone. Below this point, 
distinction among different values of the obser- 
vation variable is assumed to be impossible. 


Unlike some earlier papers in this series 
(8511,22)whose primary objectives were a descrip- 
tion of decision-making theory and threshold 
theory, the derivation of contrasting predictions 
from two theories, and the presentation of data 
relevant to these predictions, the present paper 
is based on an assumed applicability of the deci- 
sion-making theory, and is concerned, primarily 
with extending this theory. It seems worthwhile 
to point out, however, the general nature of the 
divergence between the two theories with respect 
to the three-hypothesis tasks considered here. 
This is conveniently accomplished by presenting 
a single illustrative example. 


Consider the forced-choice-in-time task 
where one of the three signal hypothoses is a 
priori more probable or more valuable than the 
other two, that is, where the signal appears in 
one of the three intervals with probability 
greater than one-third and with equal probabil- 
ities in the other two intervals, or where a 
correct choice when the signal occurs in tie 
special interval pays more than a correct choice 


*% A similar approach to specifying optimum be- 
havior in the detection-and recognition task, 
that is, in terms of drawing from normal dis- 
tributions, might have been taken in the pre- 
vious section. However, unless the two sig- 
nals are equally detectable and orthogonal, 
unless the decision components are distributed 
independently of one another, this mode of 
aneylsis involves a correlation term; in this 
case the analysis is too lengthy to be report- 
ed here. It may be seen in the appendix that 
the requirements of equally detectable and or- 
thogonal signals were not always satisfied. 


‘signal was in this interval. 


when the signal occurs in one of the other inter- 
vals. Assume, as in the case of the task describ- 
ed above, that the special signal hypothesis, or 
interval, is the third one. This task falls in 
the class of tasks characterized by a k>O, or a 
w>.50. 


According to the threshold theory, if the 
observation made in any interval is above thres- 
hold, this almost certainly indicates that the 
If this happens 
Po of the time for a given strength of signal, 
then the diagonal entries in the conditional 
probability table will be at least p,. If no 


interval yields an observation above threshold, 
then the observer randomly selects one of the 

alternatives. 
case, contains two degrees of freedom. 


The conditional table, in this 
In the 


. Presented 
1 2 3} 

Pot, (1-py) ay(1-py) .a,.(1-pQ) 
3 
Bi2 | A(1-Po) = Pon (1-Po) @a(1-p,) 
8 
ft 

3 | a3(1-p,) 3(1-Py) — Pp #23(1-p,) 


table, the @; are non-negative and add to one. 


If the observer has control of the guessing 
probabilities a;, he should concentrate the 


guessing on the most profitable alternative. The 
formula for the expected value becomes 


EV =P, 3 P(i)V; + (1-p,) ; P(i)Vja;- (23) 
=] i=1 


In the case in question where k>O (w>.5), he 
would do best to set A3=1; then the expected 
table contains four zeros. For the situation 


Presented 


when k<OQ (w<.5), it would be profitable to set 
3-03 then the expected table contains two zeros. 
By contrast, the decision-making theory leads to 

@® prediction of positive entries in each cell of 

the expected table. 


The Experiments 


In the first part of this section, the 
apparatus used in the experiments is briefly 
Gescribed. Following this, the two experiments 
are described and their results presented. 


Apparatus 


The experiments are essentially conducted 
by an apparatus called N. P. Psytar. The name 
of the apparatus is a condensation of the phrase 
Noise Programmed Psychophysical Tester and 
Recorder; the details hie construction have 
been reported elsewhere \+) . Prior to an exper - 
imental session, the randomization constants 
characterizing the experiment, and the physical 
constants characterizing the signals and noise 
employed, are effected by knob settings on the 
machine's face. The machine then programs the 
actual experiment, that is, determines whether or 
not a particular signal is presented in a parti- 
cular observation interval, by sampling the out- 
put of a random noise generator; if the voltage 
of the sample taken during the observation inter- 
val exceeds the preset level, a thyratron is 
triggered and a particular signal is presented. 


The observer is informed of the progress 
of the experiment by a set of three lights. The 
first light is a warning light; when it flashes 
a trial cycle begins. The second light flashes 
in coincidence with the observation interval, or 
intervals, the number of intervals depending upon 
the nature of the experiment. The third light 
indicates that the observer should make a select- 
ion by pushing one of the answer buttons available 
to him. After his selection is recorded by the 
apparatus, he is informed of its correctness by a 
second panel of lights which indicates the select- 
ion made by the machine. 


The signal sources used in the experiments 
were Hewlett-Packard audio-oscillators 200 AB and 
200 I. The masking-noise source was a General 
Radio 1390A random noise generator, providing a 
white, Gaussian noise up to 20 kc/s. In the first 
experiment, the two signals were a 900 cps tone 
and 1000 cps tone, each of O.1 second duration. 
The signal used in the second experiment was a 
1000 cps tone of O.1 second duration. 


The signal from the audio-oscillator is 
fed into a gate circuit; the output of the gate 
circuit is fed into an adder. The gated signal 
contains an integral number of cycles, beginning 
at sero voltage. This is fed to a Williamson 
amplifier, and then to the observers' PDR-8 ear- 


phones. The noise is fed from the noise generator 
to the adder, to the Williamson amplifier, and 
then to the earphones. All measurements are made 
at the input to the earphones, using a Hewlett- 
Packard 400 B average reading voltmeter calibrat- 
ed in RMS voltage. Throughout the experiments 
reported below, the sifuet strength and noise 
level were constant; = (twice the signal energy 
fo) 
divided by the noise power per unit bandwidth) 
was equal to 13.5. at the earphone input. 


The Detection-and Recognition Experiment 


Procedure. In the detection~-and recognition 
experiment, five groups of observations, each 
group containing, 400 observations, were obtained. 


The five groups were distinguished by the values 


assigned to the observer's correct acceptance of 
Signal 1 (1000 cps), Signal 2 (900 cps), or 

Noise alone, respectively. These value clusters 
in the order presented experimentally, were .6, 
Aopen ee ee ee ee op RE ile ery Gn Fis 
Throughout this experiment, the probability of 
occurrence of each of the three alternatives was 
fixed at one-third. 


The values listed in the preceding paragraph 
are listed there in the form in which they were 
presented to the observers. This form was chosen 
to make clear the proportionate value of correct- 
ly accepting each of the alternatives. As a 
basis for making actual payoffs. to the observers, 
the various values were equated with fractions 
of a cent such that the payoff approximated 50 
cents per hour throughout the different value 
conditions. The values employed were selected to 
yield an adequate sampling of the range within 
which none of the expected response frequencies 
drop below a minimum valve required by certain 
statistical tests performed. 


Two college students, one a music major and 
the other an English major, served as observers 
in this experiment and also in the second exper- 
iment reported below. They observed two hours 


-a day for five days a week; they received a reg- 


ular hourly rate for their services, as well as 
the payoff based on their performance. The ob- 
servers were rather thoroughly experienced before 
the experiments reported here were begun. Prior 
to these experiments, they had taken part in an 
extensive investigation of the influence of the 
amount of prior information on the approach to 
the optimum decision criteria in a Sea a 
test, which involved over 10,000 observations 8), 


The Data. The raw data, in terms of the 
response frequencies obtained, are presented in 
the 3 x 3 tables of Table I. The numbers within 
parentheses in each cell are the optimum response 
frequencies, the frequencies predicted from the 
theory. 


U5 


Accepted’ 


Accepted 


Accepted 


Accepted 


Accepted 


Values Assigned: S,N = .6, So 


Presented 


Accepted 


Observer 1 


Values Assigned: S\N=1, S)N 


Accepted 


Observer 1 


Values Assigned: SiN = 2, SN 


Accepted 


Observer 1 


Values Assigned: S\N = 1.5, SoN = 2, 


Accepted 


Observer 1 


Values Assigned: S,N = .8, S)N 


Accepted 


Observer 2 
TABLE I. The Raw Data from the Detection-and-Recognition Experiment 


146 


The Correlation Between Predicted and Ob- 
tained Response Tables. As discussed above, the 
optimum criteria are specified by Equations 13 
and 14. These equations, along with the value of 
correctly accepting each alternative, and the 
detectability index (d') characterizing each pair 
of alternatives, determine the optimum response 
frequencies. The values of d' for each observer, 
for each pair of alternatives, are obtained in 
separate experiments. Here,the detectability 
indices, af and ad! , for S,N and SN respectively, 
are obtained individually in three-alternative, 


forced-choice experiments. The index a" io is 
obtained in an experiment where either S,N or 


SoN is presented -in the single interval of a 
trial cycle. For observer 1, d') = 1.77, a'5 = 


1.45 and d'in = 1.84; for Observer 2, dt, = eettS is 
d'5 = 1.45 and a"y5 = 2.04, 


and d's listed are based on 400 observations and 
the value of d"j5 on 600 observations. 


The values of at) 


Table II gives the coefficients of rank- 
order correlation (Spearman's rho) between the 
individual predicted and obtained response tables 
of Table I. For samples of this size, a coeffi- 
cient of .62 has an associated probability of .05 
and a coefficient of .79 has an associated By" 
bability of .0l, under the null hypothesis ), 
Note in Table II that nine of the ten coefficients 
have an associated probability of less than .Ol, 
and the tenth has an associated probability be- 
tween .05 and .Ol. 


‘sn ‘sn Vy Vee ee 1 Observer 2 
6 22 al .88 92 
Z ak al 81 -98 
2 3 1 1.00 95 
Last) #2 Z 98 1.00 
8 6 iy .67 1.00 
TABLE II. The Correlation between Predicted and 


Obtained Response Frequencies in the Detection- 
and-Recognition Experiment 


The Comparison of Predicted and Obtained 
Payoffs. It is instructive to compare the payoff 


obtained by the observer with certain theoretical 
amounts of payoff. One of these, the one of pri- 
mary interest, is the maximum payoff attainable, 
given the observer's sensitivity characteristics 
(a') , d'p and d",5); a second is the payoff 
attainable by an infinitely sensitive observer; 
still a third is the payoff attainable without 
sensitivity, or with the earphones removed, by 
simply choosing consistently the most valuable 
alternative. 


_ Table III presents these various payoffs, 
for each observer, for each value condition. 
are normalized such that the payoff attainable 
with infinite sensitivity is equal to unity. 


They 


predicted 


obtained 


a? 


O 


Observer 2 at co) 


predicted 
obtained 
a? 0) 


TABLE III. A Comparison of Various Payoff: 


* 


741 | -688 TOT -665 -697 
-728 -693 -727 -658 -658 
2DOL 352 529 458 412 
1.00 1.00 1.00 1.00 1.00 
-704 -651 ° 743 -692 -640 
rs (eh 633 662 706 726 
-561 «352 “529 458 e411 


in the Detection-and-Recognition Experiment 


47 


Value Condition 


Observer 


TABLE IV. 


An overall impression of these data is more eas- 
ily obtained from their graphical presentation in 
Figure 9. It may be observed from this figure 
that the largest discrepancy between prediction 
and data is approximately 7% for Observer 1 and 
18% for Observer 2. 


That, in some instances, the payoff obtain- 
ed is greater than the payoff predicted, is ex- 
plained by the fact that the estimates of the ob- 
server's sensitivity characteristics (which con- 
tribute to the determination of the value predict- 
ed) are based on separate experiments, performed 
on different days, and are not extracted from the 
data from’which the value obtained is computed. 


An index of the congruence of the predicted 
and obtained payoffs may be had by taking the 
ratio of the two. The ratio of obtained to pre- 
dicted payoff, for each condition, is presented 
in Table IV. 


Actually, the observers may not be perform- 
ing as near optimum as the data presented in Fig- 
ure 9 and Table IV make it appear. It is quite 
possible that the payoff scheme chosen is one 
which is relatively insensitive to deviations from 
the optimum criteria, and, hence, that the ratio 
of obtained to predicted payoff is a somewhat mis- 
leading index of the congruence between the opti- 
mum criteria and the criteria assumed by the ob- 
servers. It may be, in other words, that the pay- 
off scheme employed, inadvertently, yields an al- 
most constant payoff over a relatively wide range 
of criteria in the critical region. Whether or 
not this is the case is difficult to ascertain 
directly, since there are two degrees of freedom 
in the criteria corresponding to a given value 
cluster, and hence the expected payoff as a func- 
tion of change in criteria cannot be represented 
in two dimensions. This difficulty is overcome 
in the forced-choice-in-time experiment reported 
below, since in the latter experiment a symmetry 
exists between two of the signal alternatives; 
this means that the criteria corresponding to a 
given value matrix and a given set of a priori 
probabilities may be assumed to possess a single 
degree of freedom, thus permitting a simple dis- 
play of expected payoff as a function of change in 
criteria. This problem is taken up again in the 
context of the forced-choice-in-time experiment 
discussed below. 


An Analysis in Terms of Chi-Square. Several 


Chi-square tests were applied to the response 


£55 Apel 


eta 
The Ratio of Obtained to Predicted Payoff in the Netection-and-Recognition Experiment 


Lylpls  ses3el) oc dlcds cst eo 


97 .89 1.02 1.14 


tables'in the detection-and-recognition experi- 
ment. The complete numerical results sake von 
Os 


‘sented in a forthcoming technical report 


148 


One set of Chi-square tests tested the obtained 
response tables against those predicted by using 
the sensitivity characteristics (4'j, d',, d' 15) 


obtained in the control experiments and the opti- 
mum criteria. In all cases, a very significant 
difference exists. For certain cases, the bound- 
aries were then shifted to lower the Chi-square, 
still using the equal-variance, Gaussian model 
and the sensitivity characteristics estimated 
from the control experiments. This procedure 
reduced the values of Chi-square considerably. 
Because of the cut-and-try method used, and the 
rather large amount of computation time required 
to obtain each new set of response frequencies, 
no program of analysis was conducted. However, 
enough was done to say that the primary contribu- 
tion to the Chi-square resulted from the assump- 
tion that the criteria assumed by the observers 
were actually the optimum criteria, small changes 
in the location of the criteria (5 of a standard 
deviation) causing large changes +2 Chi-square and 
practically no change in expected payoff. 


The Forced-Choice-in-Time Experiment 


Procedure. In the forced-choice-in-time ex- 
periment, two main conditions were employed. In 
the first, the value associated with correctly ac- 
cepting each of the three alternatives was con- 
stant (V) = Vo = V2 = 1), whereas the a priori 
probabilities of occurrence of -each of the alterna- 
tives were changed from one subcondition to anoth- 
er, from P) = Py = .445 and P, = .11 to Py = Pp 
-167 and P. = 6. In the s€écond main condition, 
the situatfon was the reverse; the probabilities 
were constant (P; = Py = P, = .333), and the values 
were changed from V] = Vp = 1 and V3 = 4 to Vy = 
Vo = 4 and Vz =1. In the first main condition, 
the a priori probabilities were selected to yield 
w's of 2.00 and .125 (see Equation 16); in the se- 
cond main condition, two different sets of values’ 
were also selected to yield w's of 2.00 and .125. 
In each of the four subconditions, 400 observa- 
tions were obtained. 


The Data. As in the case of the preceding 
experiment, the raw data are presented in 3 x 3 
tables, in Table V. Again the optimum or predict- 
ed response frequencies are listed in parentheses. 
The data in the first four individual tables were 
obtained with V] = Vo = Vz = 13 the data in the 
lasv four individual tablés were obtained with 


Presented Presented 


Accepted 
Accepted 


Observer 2 


SG eles 


Presented Presented 


Accepted 
Accepted 


Observer 2 


Presented 


Accepted 
Accepted 


Observer 2 


Presented 


Accepted 


Accepted 


Observer 1 


Observer 2 
TABLE V. 


The Raw Data from the Forced+Choice-in-Time Experiment 


149 


P= Po = P3 = .333. These tables are essentially 


five-fold tables; the respective numbers of cor- 
rect responses to the first two alternatives are 
not distinguished as they were in the experiment 
reported above. It may be noted, relative to 
the discussion of the threshold theory above, 
that none of the cell entries is zero. 


The Data in Relation to Detectability 
Curves. The data of Table V, converted to con- 
ditional probabilities, are plotted among the 
detectability curves specified by the theory, 
in Figure 10. The number labelling a point in- 
dicates the observer from which the point was 
obtained. The letter P following the number 
indicates that the point was obtained under a 
condition in which the a priori probabilities 
were unequal; the letter V indicates a condition 
in which unequal values distinguished the signal 
alternatives. (The consistent difference be- 
tween the estimates of d' obtained when the pro- 
babilities were varied and when the values were 
varied is tentatively associated with a drift in 
the meter used to set the noise power.) The data 
“points falling to the left of the line w = .50 
were obtained under those conditions character- 
ized by a w = 2.003 the points to the right of 
this line were obtained when a w = .125 was in 
erfecti. 


For the sake of comparison, the same points 
are plotted among the detectability curves spec- 
ified by the threshold theory, in Figure ll. 
These curves are traced under the assumption that 
P,(A,) = Po(As) is equal to the Py associated 
with a given strength signal (the "true" detect- 
ion probability, the probability that the obser- 
vation will exceed the threshold) plus a chance 
factor which equals one-half of P3(Ay U Ay) The 


top curve on the graph, Do + 5(1-Py), is a re- 

flection of the bottom curve, p,, about the axis 
P1 (Ay) = Po(A)o = .50. 
limit of P3(A, U Ap), which is presumably equal 


It defines the upper 


to at ae and of Py (Aj) = Po(Ao), for a given 


strength of signal. 


It is immediately apparent from Figures 10 
and 11 that the present experiment was not design- 
ed to yield data which might be said to fit one 
set of curves better than the other. There is, 
however, one basis on which Figures 10 and 11 may 
be used in a@ comparison of the present data with 
the two theories: given that the observer can 
control P3(Ay U Ay), since it shifts appropriately 
with a change in w, under the threshold theory 
one would predict that all of the data points 
would fall either on the line po + =(1-p,) or on 
the line P(A) ¥ Ap) = 0. e 


The Correlation Between Predicted and Ob- 
tained Response Tables. The optimum, or predicted, 
response freouencies for the forced-choice-in-time 
task are obtained, given the value of d' and wv, 
from the detectability curves of Figures 5 and 6. 


As just noted, each of the four subconditions 


150 


yields a point on each of these graphs. The op- 
timum conditional probabilities for a value of 
a', so determined, and the appropriate value of 
w, are read directly from the graphs; these con- 
ditional probabilities are then converted to re- 
sponse frequencies. Unlike the procedure in the 
first experiment reported, the optimum response 
frequencies that are compared with a given set 
of obtained response frequencies are determined 
in part by the d' characterizing only that set 
of obtained response frequencies. In this ex- 
periment, in other words, the sensitivity char- 


acteristic of the observer is not determined in 
a@ separate control experiment and then assumed 


constant, for purposes of analysis, throughout 
the experiment. 


Table VI presents the coefficients of rank- 
order correlation between the individual predict- 
ed and obtained response tables of Table V. For 
samples of this size, a coefficient of .80 has 
an associated probability of .05 and a coeffi- 
cient of .90 has an associated oe of 
.Ol, under the null hypothesis ‘2) . 


Probabilities Values 
Variable Variable 


=2.00 


Observer 1 1.00 


Observer 2 


TABLE VI. The Correlation between Predicted and 
Obtained Response Frequencies in the Forced- 
Choice-in-Time Experiment. 


The Comparison of Predicted and Obtained 
Payoffs. As in the case of the previous exper- 


iment, the payoff earned by .the observer may 

be compared with the payoff attainable given his 
demonstrated sensitivity characteristic, and 

with the payoffs attainable with d' = » and 

d' =O. These amounts are listed in Table VII 
and presented graphically in Figure 12. The 
amounts are normalized so that the payoff attain- 
able with infinite sensitivity is equal to 1.00. 
It may be seen in Figure 12 that the largest 
discrepancy between prediction and data is approx- 
imately 5% for Observer 1 and 8% for Observer 2. 


The indices of the approach of the obtained 
payoff to the predicted payoff, in terms of the 
ratio of these quantities, are given in Table 
WARES 


By recourse to plots of the expected pay- 
off vs. P3(Ay UA5), it may be stated that, in 
this experiment, the observers are very proba- 
bly not performing as strikingly near optimum 


as indicated by the data displayed in Figure 12 
and Table VIII. The plots of expected payoff 


Observer 
a 


predicted 


obtained 


at 


Observer | a! 


2 


predicted 


obtained 


Probabilities Variable 


Values Variable 


-813 


-817 


(27 


1.00 


-199 


TABLE VII. A Comparison of Various Payoffs in the Forced-Choice-in-Time Experiment. 


Probabilities Variable 


Observer 1 


Observer 2 


TABLE VIII. 


V5. P3(A, ¥ Ap) presented in Figures 13 and 14, 


fortwi=.125 and w = 2.00 respectively,’ show an 
almost constant payoff to obtain over a consid- 
erable range of P(A, ¥ Ap). For w =.125, the 
payoff is essentially constant from P3(A, UY Ap) = 
4O to .98 for the lowest value of d' obtained, 

a' = 1.2 (see Figure 10), and from .10 to: .98 for 
the highest value of d' obtained (d* = 1.8). It 
should also be noted, however, in Figure 10, that 


a5) 


Values Variable 


The Ratio of Obtained to Predicted Payoff in the Forced-Choice-in-Time Experiment 


the observers are not making full use of this 
latitude. The discrepancies between the obtained 
estimate of P3(A) U Ap) and the optimum P3(A) ¥ Ap) 
for w =.125 are .18 and .15 for Observer 1 and 

-26 and .16 for Observer 2. For w = 2.00, the 
payoff curve is fairly flat from P3(A, U Ap) 
to .20. The deviationsof obtained from optimum 
P3(Ay U Ao) are .O4 and .01 for Observer 1 and 
-O7 and .O4 for Observer 2. This analysis, 


OL 


however, does make it clear that one of the more 
appropriate measures of congruence of obtained 
and optimum performance may be quite insensitive, 
unless the payoff scheme is carefully constructed. 


Summary and Conclusions 


The concern in this paper is limited to a 
particular type of detection problem, that in- 
volving a fixed, finite observation interval. The 
observation made by the observer is assumed to be 
mapped onto a small-dimensional space for purposes 
of decision; the observation made in a two-hypo- 
thesis test is mapped onto a one-dimensional 
space, and the observation in a three-hypothesis 
test is mapped onto a two-dimensional space. In 
the theory of optimum detection, such a mapping 
is @ monotone function of the likelihood ratios 
of the observation. [In the application of this 
theory to the human observer, such @ mapping is 
considered fixed but not necessarily optimum; 
however, the subdivision of the space into criter- 
ia will be related to the likelihood ratios on 
this reduced observation space. As a matter of 
convenience, it is assumed that the noise distri- 
bution in the reduced observation space is normal- 

‘ized Gaussian; this is not a restriction of gen- 
erality, but a degree of freedom at the analyst's 
disposal. 


In detectability theory, many related defi- 
nitions of "optimum" are used. A single one of 
these, the expected-value definition, is consider- 
ed in this paper. The observer is informed of the 
existing a priori probabilities of all events and 
the value associated with each possible outcome 
of a decision. His ability to adjust the deci- 
sion criteria to maximize the payoff is studied. 
In the experiments reported, the value of an in- 
correct identification was zero, of a correct re- 
sponse, positive. This scheme was employed for 
two reasons. The first was to simplify the pre- 
sentation of the task to the observer. Perhaps 
more important is that this scheme makes each 
decision function linear with one log likelihood 
ratic, and under the condition of a normalized 
Gaussian noise, the «usual distance in the space is 
proportional to log likelihood ratio in a manner 
independent of the detectability indices. This 
simplifies the model and makes computation com- 
paratively simple. <A geometric model using like- 
lihood ratios directly as axes will also have 
straight-line boundaries in all cases of value 
assignments (see Fig. 1 of Reference 15), but the 
distributions in such a case make computation 
difficult. 


Two types of the three-hypothesis test are 
studied. A detection-and-recognition experiment 
involving the hypotheses Signal One, Signal Two, 
and Noise Alone, is related to previous work (10) 
which treated the hypotheses pairwise. In the 
present paper, the hypotheses were treated pair- 
wise to obtain the configuration of the reduced 
observation space; this configuration, together 
with the optimum placement of boundaries of cri- 
teria, permits one to make predictions. A forced- 
choice-in-time experiment reported is the first 


one to employan unbalance among the signal hypo- 
theses, thus demanding study of the role of cri- 
teria adjustment. In this experiment, one degree 
of symmetry among the hypotheses was retained in 
order to determine the sharpness of the optimum, 
and hence the relative sensitivity of the experi- 
ment to the task assigned the observer, that of 
maximizing the expected value. 


Two methods of manual computation of the 
predicted response frequencies are presented. In 
these methods, particular emphasis is placed upon 
the numerical specification of the distance of the 
optimum boundaries from the hypotheses means as 
functions of the values, a priori probabilities, 
and the detectability index; and upon a geometri- 
cal model, showing the hypothesis configuration 
and the placement of the boundaries. 


The exact experimental procedure is review- 
ed. The data from two observers are tabulated in 
response-frequency tables, together with predic -. 
tions based on measured sensitivities and optimum 
criteria. The correlation between prediction 
and data is shown to be highly significant. The 
expected value is compared with the payoffs ob- 
tained by the observer; the discrepancy is found 
to be very small percentagewise. Some response- 
frequency tables were examined using a Chi-square 
test; this test indicates that the criteria as- 
sumed by the observers are definitely not the 
optimum criteria. However, in view of the fact 
that the expected value, which constituted the 
observers! motivation, is not sharply distributed, 
the ‘Chi-square test may be regarded as overly 
sensitive. 


In a study of this sort, it is appealing 
to use a probability model that is a reasonable 
picture of the physical situation represented by 
the experiment. In the present case, then, an 
attractive model would be one involving represen- 
tation of a pulse signal with unknown carrier 
phase and some uncertainty with respect to start- 
ing time and duration, and possibly of frequency= 
scanning characteristics. Computation of predict- 
ed response-frequency tables with asymmetric cri- 
teria under such a model is extremely difficult. 
Studies designed to explore the applicability of 
this more complex model are being conducted, but, 
in general, they avoid the problem studied here, 
that of criteria adjustment, by relying on obvious 
physical symmetry to force symmetric criteria, and 
by using only the average correct score as the 
measurement. For the purposes of criteria-adjust- 
ment studies at these levels of noise @'; of 1.0 
to 2.0), the extremely simple model, in which the 
signal-plus-noise distribution is a translate of 
the distribution of noise alone, and the signals 
are at measurable angle with each other, proves 
to be an adequate model for prediction of expected 
payoff and of response-frequency tables to within 
a few percent. 


It may be worthwhile to emphasize that the 
congruence of predictions and data in a normative 
study like this one has a dual significance. 
Whereas the concern in theoretically-oriented 


152 


experimental activity lies usually with the match- 
‘ing of theory to observed behavior whatever it 
may be, in this sort of study an interest in the 
subject's ability to match the predictions that 
follow from the theory achieves equivalent status. 
The present study demonstrates that a simple 

form of. the decision-making model permits predict- 
tion of the detection and recognition behavior of 
the human observer in fairly complex situations 
and, in addition, indicates the extent to which 
intelligence may influence a process usually 
assumed to involve primarily fixed parameters. 


References 


Birdsall, T. G. "The Theory of Signal Detect- 
ability" In Quastler, H., (ed) Information 


Theory in Psychology. Glencoe, Ill.: Free 
Press, 1955. 


Kendall, M. G. The Advanced Theory of Sta- 
tistics. London: J. B. Lippincott Co., 1943. 


Peterson, W. W. and Birdsall, T. G., "Theory 
of Signal Detectability," Part I and II, 
Tech. Rpt. No. 13, Electronic Defense Group, 
University of Michigan, 1953. Also available 
in 1954 Symposium on Information Theory, 


Transactions of the IRE, PGIT-4, September,1954. 


Roberts, G. A., "An Automatic Random Pro- 
grammer," Department of Electrical Engineer- 
ing, Engineering Research Institute, Electron- 
ic Defense Group, University of Michigan. 


Shannon, C. E. and Weaver, W. The Mathema- 
tical Theory of Communications. University 
of Illinois Press, 1949. 


Swets, J. A. The Influence of Various Amounts 
of Prior Information on Decision-Making. 
Technical Report, Electronic Defense Group, 
University of Michigan (in preparation). 


Swets, J. A. and Birdsall, T. G. "Decision- 
Making in Detection and Recognition Situations 
Involving Multiple Alternatives. Technical 
Report, Electronic Defense Group, University 
of Michigan (in preparation). 


Swets, J. A., Tanner, W. P., Jr., and Birdsall, 
T. G., "The Evidence for a Decision-Making 
Theory of Visual Detection," Tech. Rpt. No. 
40, Electronic Defense Group, University of 
Michigan, 1955. 


9. Tanner, W. P., dr., "On the Design of Psycho- 
physical Experiments" In Quastler, H. (ed) 
Information Theory in Psychology. Glencoe, 
Ill.: Free Press, 1955. 

10. Tanner, W. P., Jr., "A Theory of Recognition,’ 


Technical Report No. 50, Electronic Defense 
Group, University of Michigan, 1955. 


11. Tanner, W. P., Jr., and Swets, J. A. "The 
Human Use of Information: I. Signal Detect- 


ion for the Case of the Signal-Known-Exactly. 


Transactions ‘of the IRE, PGIT-4+, September, 
1954. 


Tanner, W. P., Jr., Swets, J. A., "A Deci- 
sion-Making Theory of Visual Detection," 
Psychol. Rev., Vol. 61, No. 6, 1954. 

13. Tanner, W. P., Jr., Swets, J. A., and Green, 
D. M., "Some General Properties of the Hear- 
ing Mechanism," Tech. Rpt. No. 30, Electron- 
ic Defense Group, University of Michigan, 
1956. 

14. Van Meter, D. and Middleton, D., “Modern 
Statistical Approaches to Reception in Com- 
munication Theory," 1954 Symposium on In- 
formation Theory, Transactions of IRE, PGIT-4, 
September, 195. 

15. Van Meter, D. and Middleton, D., "On Opti- 
mum Multiple-Alternative Detection of Signals 
in Noise," IRE Transactions on Information 
Theory, Vol. IT-1, September, 1955. 


Appendix 


In order to predict the 3 x 3 tables, the 
probabilities of the criteria in a 2-dimensional 
normal distribution must be computed. This was 
done by a graphical counting technique. First 
it is observed that each square on double-proba= 
bility paper,* ruled in one-percent steps, has a 
probability of 10°" for a 2-dimensional normal 
distribution centered at the 50 percent-50 percent 
intersection, with the proper standard deviation. 
Therefore a three-hypothesis triangle is drawn to 
scale with any convenient orientation as in Fig. 
536 


The position of the boundaries for a given 
three-hypothesis triangle is determined from 
Equations 13 and 14. Then the triangle and bound- 
aries are drawn on three double-probability grids, 


- one for each mean at the center of the paper. 


153 


Figures 16 through 18 display these graphs drawn 
on paper with squares of probability 10°°, for 
demonstration and not for computation. 


This computational method was adopted be- 
cause only certain specific tables were desired. 
For the case of equally-difficult, orthogonal sig- 
nals i.e., for a", = a5 =NO.5 djo, as were 
computed which allowed interpolation to = percent 


accuracy, and the nine-fold tables could be ob- 
tained easily from these. The tables, of course, 
were derived from computation of double-probabil- 
ity grids. 


z The abscissa 


ak 


—=— 


Von 


exp(-x°/2)dx, where t is a linear scale. 


Codex Graph Sheet No. 4253. 
and ordinate scales are each 0(t) = f* 


=00 


noise 


of 
nd signal-plus-=noise. 


g 
fe} 
5 
5 
ie 
»p 
7) 
J 
® 
& a 
i] 
co 
ap 
fl 
&, 


fo} co) Cd fi © 4 E oe) N = 
OO. 2 OC: a Or Mm 2Ol 6) tO Oa EO 6aG FC 


Fig. 2 = Psy(A) vs Py(A) with d! as 
the parameter. 


NEAR Ur re 
INS U8 des bd BBE 


15h 


Pic) 


Fig. 3 - A geometrical model of the detection- 
and-recognition task. 


iN 


Met eel ON 


Fig.  - P(c) vs d' for the 
three-alternative, 
forced-choice task. 


P, (A,)= P5 (Ad) 


“Ol 10 20 


0) 9 8) e,0)  S0) 70 80 90 ES 
is P3 (A) = P, (AY A,) 


Fig. 5 = Py(A;) vs P3(A,U Ap) with d! and w as parameters. 


156 


P,(A,) = Po(A) 


.20 30 40 SO .60 70 .80 


P, (Ap) = P (Ap) 


Fig. 6 = P,(Aj) vs Pj(Ap) with d' and w as parameters. 


157 


Fig. 7 - A geometrical model of the Fig. 8 = A geometrical model.-of the forced- 
forced-choice-in-time task, choice-in-time task, B. 
Ae 


PAYOFF 
PAYOFF 


OBSERVER | OBSERVER 2 


6,.2,1 4! 2,3, 15,2, 1 8,6, 6,2,1 ih 2,3, 15,2,1 -6,.6,! 
VALUE CONDITION VALUE CONDITION 


Fig. 9 - Various payoffs as a function of the value matrix. 


158 


F, (A) = Po (Aa) 


d kd'= LOG, 2w 


Ol 10 .20 


ate awa 
Ly 
ie 


10 80 90 


30 40 ~=.50~ = .60 
1-P3(A3) =P3(A,Y Ag) 


Fig. 10 - The data compared with the detectability curves specified by the 
decision-making theory. 


159 


9S) 


AN 


NAN 


y) 2g i iy) lg 


80 - 


Fig. 11 - The data compared with the detectability curves specified by threshold 


theory. 


160 


PAYOFF 


PAYOFF 


W125 W=2.00 W125 W:2.00 
PROBABILITIES VARIABLE] VALUES VARIABLE 


w:.125 w=200 
PROBABILITIES VARIABLE 


EXPECTED PAYOFF 


‘01.02 05 10 20 3 4 .50 60 70 80 90 95 98 
U 
p, (AVA) 


Fig. 13 - The expected payoff vs P(A, U Ay) for w = .125 


161 


EXPECTED PAYOFF 


Fig. 1) - The expected payoff vs P3(A,¥ Ap) 


for w = 2.00. 


P, (AY A,) 


MEAN OF 
® 
S2N DISTRIBUTION 


d', = 1.45 dj, = 1.84 


MEAN OF MEAN OF 


a ee 


e @ 
N DISTRIBUTION S,N DISTRIBUTION 


Fig. 15 - The geometrical model of the 
detection~and-recognition 
task. ; 
d, = 1.77 


162 


SIGNAL TWO 


S| Se 
@ 


NOISE ALONE SIGNAL ONE 


ra 


+ 


Fig. 16 = Noise alone presented. 


163 


SIGNAL TWO 
CO : 


—.—e -—e— 
NOISE ALONE SIGNAL ONE 


Fig. 17 - Signal one plus noise presented. 


164 


a 
NOISE ALONE | SIGNAL ONE 


Fig. 18 - Signal two plus noise presented. 


165 


UN OPTIMUM NON-LINEAR EXTRACTION AND CODING FILTERS 


A.V. Balakrishnan*and R. Drenick 
GENERAL ENGINEERING DEVELOPMENT 


RADIO CORPORATION OF AMERICA 
CAMDEN, NEW JERSEY 


SUMMARY 


Now it is well-known? that if the optimal esti- 


The problem of determining optimal non-linear least- :mate is denoted by sy*, 


square filters is solved for a class of stationary 
time series. This theory is then used as the basis 
for developing a band-width reduction scheme using 
non-linear encoding and decoding filters, for the 
same class of signals. A simple illustrative ex- 


ample is included. 


This paper consists of two parts. In Part I we 
consider the problem of determining optimum non- 
linear (least-squares) extraction filters for a 
class of stationary time series. The results ob- 
tained in Part I provide the basis for a bandwidth- 
compression coding scheme which is discussed in 
Part II. 


For the sake of simplicity, discrete parameter 
processes are considered. 


|. OPTIMUM NON-LINEAR EXTRACTION FILTERS 


Let {s,} be a strictly stationary”* discrete 
parameter stochastic process identified as the "sig- 
nal" and let {N,} be a second similar process, 
statistically independent of the first, identified 
as "noise". Let 


xn = Sn + Ny 
Then {x,} is also a (strictly) stationary process. 


Let F be any translation-invariant operator on the 
part of {x,} so that 


Yn = Flag, m <n] (1) 


The central problem with which we are concerned in 
this part of the paper is that of determining the 
form of the optimal F which minimizes the mean 
square error: 


E((s,, = yell 
where E[ ] denotes the expected value. [If, as is 


usual, ergodicity is assumed the phase averages may 
be replaced by time averages. ] 


* Now with RCA, West Coast Engineering, Los Angeles. 


™x For the terminology used see reference /. 


s,¥°= Eleclze, win) (2) 


which thereby determines the optimal F as well. 
Further, the actual minimal mean square error it- 
self is given by: 


minimal error = E[(s,)?7] - El(snt)*] ... (3) 


The problem, however, is that of determining 
the optimal operator given by (2) in closed func- 
tional form. When F is restriced to be linear, com- 
plete solution is possible and if, in addition, 
ergodicity is assumed, can be obtained by working 
entirely with time averages, as Wiener does in his 
classic work.” With the linearity restriction re- 
moved, however, the complexity of the problem in- 
creases considerably, in common with most non- 
linear problems. Progress can nevertheless be made 
if, as a first step, the class of processes con~ 
sidered is suitably restricted, while still re- 
taining engineering usefulness. Such a restriction 
we consider below. 


It is well-known that any Gaussian process can 
be derived from white Gaussian noise [shot noise] 
by a linear filter. Extending this, we assume that 
the structure of the signal and noise processes to 
be considered in this paper, is such that they can 
both be derived from stationary "pure white" (but 
not necessarily Gaussian) primary processes by 
linear filters acting only on the'past. We further 
assume, for the purposes of paper, that the linear 
filters differ only by a multiplicative constant. 
Under these conditions we have the representation: 


a 
Sn = > Cry 
1) 


@o 
> UK Nn-K 
(e) 


where we may further take without loss in generality 
@ 
o.= 
wy — 1 
0 


(4) 


Nn 


Since {s,}, {N,} are statistically independent, so 
are {{} and {n,}, besides being pure white pro- 
cesses. A consequence of this representation is, 
as we shall see, that the optimal linear filter is 
reduced to a multiplicative constant. 


We note that if we let 


cn Seen ae) 


{z,) is also a pure white process, and further we 
have the representation 


@ 
Xn = ET Zn-K (5) 
0 


We next make the assumption that the transforma- 
tion in (5) has an inverse. We have then the re- 
ciprocal relationship: 


@ ‘ 
tn = Dex *n-k (6) 
0 


In order to obtain the functional form of the 
optimal operator, we begin by using the basic re- 
sult (2) 

s,* = Blewlxas aon) 
To avoid index trouble, we now use the stationarity 
property. (Indeed, it may be noted that our repre- 
sentation (4) makes the processes ergodic.) Then 
it suffices to find* 


See = Elsolz-, m SO] 


If we now use the fact that the transformation (5) 
is one-one, we have: 


So’ = Elsolza, a= o)- 


the {z,} being determined in terms of the {x} by 
relation (6). Again, since 


OE Cr? 
it follows that 


E{solzn, n= 0] ER ze ne) 


M 


the convergence being in the mean and with proba- 
bility one. Now 


Eas me ONS ElCylz4) 
Since C, is independent of z, for mK. 


’ Thus we have that: 


So. zi So Ele ley) ‘ (7) 
0 


Again, stationarity permits us to write 
El(E,|z,] = flzy) (8) 


So that we have finally, 


5) AE yor flzyx) (9) 
0 
and 


Seo = dvr Wate) (10) 
0 


(FSS STS 
® The convergence oroperties involved here may be derived 
for instance from the Martingale theory of Doob, ref- 

erence /. 


167 


At this point we can block-diagram the non- 
linear filter represented by (10). Thus in Figure 
1, the optimal filter consists of 3 sections, a 
zero-memory device sandwiched between two mutually 
reciprocal linear filters: The {x,} process is 
first equalized to have a flat-spectrum, that is, 
to yield {z,}. The zero memory device has transfer 
characteristic f(.). The third filter is the re- 
ciprocal of the first and restores the spectrum to 
its original shape. It may be noted that filter 
memory is confined to the linear section. 


Synthesis of The Zeromemory Filter 


From (10) it is seen that the major problem in 
the synthesis of the optimal filter is the determi- 
nation of the form of the function f( ). [It will 
be noted that determining f( ) is equivaient to 
solving the optimisation problem for pure white 
processes.] Actually the problem here is two-fold: 
One is tp obtain f( ) in closed form and the 
second is to obtain a good approximation for it 
suitable for physical synthesis. 


The first step in determining f(.) is of course 
to specify the distributions of C, and z,- Here it 


has been found most convenient to assume that they 
can be approximated by taking a finite number of 
non-zero coefficients in a Gram-Charlier Expan- 
sion.’ Of course, other representations are pos-— 
sible, but here again, our preference is guided by 
the needs in Part II. Before we use the Gram- 
Charlier representation, we note that 


flzy) = E(Cylz,] 


ee Rad ie 


fs Pols) Py (ap sido, 
eee a (11) 
bar er! : 
To simplify (11) further, it is somewhat easier to 
work with characteristic functions. Let 


C(t) = fexp itz Pr (2) dé 


" 


c(t) Sexp itn P, tm) dn 


Then it follows that 


P) 
flexp aunt ete] = eqte| cyte) ae 


fips f . 
(exp - it zy) C(t) C(t) dt 


(12) 


using independence of the Cy, and 7, process and 
simple properties of convolutions. 


The use of characteristic functions makes it 
particularly easy to work with Gram-Charlier expan- 
sions. For simplicity we assume zero means and let 


¥ a, t 

C,(t) = E +h uo] exp cs oa 
a iM z (13) 

C= E aie io | exp <n 


where oO, Ty are the variances of Cx and Ny (or of 
C, and N,) respectively. Substituting (13) into 
(12) we have: 


n B of to% 
(4 ¥ 
E +> <7 aor] exp (2) x exp — iz yt dt 
3 


(14) 


t? exp - Uzyt dt 


Relation (14) can be further simplified in two 
ways, both useful. In the first, and obvious way, 
we let 

2 2 ee 


Cie 
(Gut) Re expi—azet exp --— t? dt Exp (32 42} 
2 2 SS ey 


H(z) 2? d z? 
= = LSS ee 
(iy) bios 2(0% +02) |dz™ E 2 (02 +02) 
S N ¥ 
where Hi,(z) is the Hermite polynomial of degree n 
associated with the Gaussian of variance (a2 +o%). 


Then f(z) can be expressed as a ratio of these 
Hermite polynomials as 


2N+1%n 
—H, (z) 
nai nt 
ON yy 
1+ > ea NE) 
n=3 nt 
where the coefficients involved are obtained by in- 
spection from (14). In particular, we note that 
the coefficients y, in the denominator are the 
Gram-Charlier coefficients in the expansion of the 


distribution of ZK 


Alzes (15) 


Unfortunately the denominator in (15) makes 
mechanization difficult and further is not suffi- 
ciently suggestive. For example, it does not pro- 
vide any answer to a question such as: What is 
the best nth order approximation to the filter? 
For this purpose, we note first from (11) that 
f(z) is of the form 


168 


g(z) 
P(z) 


f(z) = 


where the denominator is the distribution of Zye 
Let {P,(z)} be-the sequence of polynomials ortho- 


normal? with respect to P(z), so that 
frae Bel 2) BN 2 Prd mere = 
Then f(z) can be expanded in terms of these poly- 


nomials so that 


@ 


RED I DIM oel 4) boo (16) 
where . 
oe = [Pt g(z) dz (17) 


The convergence of (16) is both in the mean and 


« n 
with probability one, and further, yes Po(ziais 
0 


the best approximation to f(z) in the mean by an 
nth order polynomial. 


Moreover, substituting (16) into (10) , we ob- 
tain 


i) 
| 


(18) 


a zs ae [= Ci Plz.) 


py Cc] ba wy Piz] 


where in the second form the lth term in the series. 
yields the best lth order approximation to’ the fil- 
ter. (It is interesting to note parenthetically 

that the filter given by (10) is already of class 
No,» in general, in the classification scheme of 
Zadeh, * while (19) explicitly provides a "power 
series" expansion for it.) 


(19) 


The minimal error given by (3) can again be 
expressed using (18) as: 


Pipa) Sales ei 


[$2] [o2 - Se3] 


0 


ae 


I 


2 <i 2 
G31s ey eee 
0 
Relation (20) makes clear how the error decreases 


as higher-order approximations are used. 


(20) 


The orthonormal polynomials P,(z) can be ex- 
plicitly obtained in terms of the moments of P(z) 
as given for instance by Cramer, reference 3, p. 
132. In computing the coefficients, it is con- 
venient to note that: 


d a = #2 (0% + G2) 
=(-; ee x 
fr g(z)dz = ( or a[e BF (it)” exp : | (21) 


t=0 


In particular, the first two polynomials can be 
easily seen to be: 


is] 

N 
" 

= 


1B. ea), 
o oF 
and 
oe 0 
Cres 


V 2 2 
emo § 


so that the best linear filter is given by 


2 2 
oS $ ¥ Op 
es Sa suns 
2 K “n-Kk 2 2 
- a5 toy 0 oo oy 


with corresponding error 


eS 
SoS 
s 2 32 
hc ar) 
2 
4 CPi CFF 
Pipe 2 
+ 
LE eT | 


The best linear filter is thus reduced a mul- 
tiplicative constant. The computation of higher 
order terms in (18) in this generality is as 
laborious as unnecessary and the results for a 
specific example can be found in Part II. 


2. OPTIMUM CODING AND DECODING FILTERS 


The optimal non-linear filter theory outlined 
in Part I will now be used to develop a bandwidth- 
reduction coding scheme for the same class of sig- 
nals, which is relatively simple to implement: 


To be specific, we consider two statistically in- 
dependent signals with identical statistics of all 
orders (such as, for instance, 
long samples drawn from the same population), and 
show how these may be transmitted so as to occupy 


only as much frequency—band as required by each 


two independent 


signal individually, and received subject to any 
specified degree of fidelity. The measure of fi- 
delity adopted here in line with Part I is the 
mean square error. 


A block-schematic of the system is given in 
Figure 2. Each signal is coded by an encoding fil- 
ter at the transmitter prior to transmission over 
the same channel. The transmitted signal is thus 
the algebraic sum of the coded signals. At the re- 
ceiver the operations of extraction of coded sig- 
nals from the mixture and decoding to obtain the 
original signals are performed. The detailed | 


169 


structure of the various blocks is discussed be- 
low, but briefly, the coders serve to impart pre- 
scribed statistical characteristics to the signals 
so as to make the extraction as good as possible, 
while the decoders restore the original statistics 
of the individual signals, within the specified 
fidelity limits. 


Let {st}, {s2} be the two signals, the super- 
scripts identifying the signal, both of which are 
assumed to be representable as in (4) by 


@o 
Lr 1 
Sn ah Caer 
; Pe (22) 
= 2 
Se 2% Gray 
where 
@ 
>% = fu 
and : 


Abe tel Riad 3 9 


The two signals are assumed to have equal power 
and unit power is assumed for convenience. Identi- 
cal statistics are obtained by taking all moments 
of i and CF equal. If these signals were trans- 
mitted along the same channel without coding, the 
optimum least-squares filter would be linear, in 
fact would reduce to a multiplicative constant 
equal to one-half, leading to a mean square error 
of one-half. 


The purpose of coding is to impart such char- 
acteristics to the Signals as to improve the fi- 
delity of reception. From our discussion in Part I 
we see that this may be done by non-linear filters 
which change the characteristics of the primary 
processes. Accordingly, we first equalize the sig- 
{s7} and {s7} to obtain the {Ct} and {¢7} 
process, which it is possible to do for the class 
of signals we are considering. A simple zero-mem- 


nals 


ory non-linear "encoding" device is used to charge 
the first-order moments of the equalizer signals in 
a way to be described presently. Finally the ori- 
ginal spectrum is restored before transmission. 
Thus the coding filter illustrated in Figure 3 con- 
sists of 3 sections: an equalizer, a zero-memory 
device and a linear filter that restores the’ 
(spectral) shape of the spectrum. The problem then 
is of determining the characteristics of the zero- 
memory device. 


Let us for the moment consider one of the sig- 
nals, say {Soh Let the density of the equalized 
or primary process ee be expanded into a Gram- 
Charlier series: 


Kae 
Patz) = SPE H(z) ol (23) 


where 


G(z) is a Gaussian with unit variance and 


and 
H(z) is the Hermite polynomials of order n 


associated with G(z). 


Since no change in power is desirable, as a result 
of coding, we can represent the density of the 
transformed variable C,,. 


= 1 
Oe - 6, (6,) 

with corresponding inverse transformation, 
ce = A(t) 


as yi a, 

Piale les E + zr H4¢2)] Gz) (24) 
where the @, are still to be specified. But assum- 
ing for the moment they are the non-linear coder 
characteristic can be determined with the aid of 
(23) and (24) by the differential equation 


a ya, dy La 
G 1 + a — = — 
y | SF a] ais) [a 43 is H,(| 


where 


(25) 


y = g(x) 
Solving this is routine and to avoid notational 
complexity we shall do this subsequently for a 
specific example: 


The second signal is coded in a similar way, 
letting 


nwa gg les) 


So as to have 


1B, 
Py (2) = : Bet ain] ats 
The £,'s here are yet to be specified. 


(26) 


The design philosophy for determining the op- 
timal a's and §,'s can take several forms. For 
the purposes of this paper, we have used the fol- 
lowing: The a,'s and £,'s are so chosen as to 
make the mimimal error after extraction by the op- 
timal filter (determined in Part I)-.as close to 
zero as possible. Since this means that the coded 
signals are recovered with arbitrarily small error 
by the extractor, the decoders are determined 
merely as the operational inverse of the coders. 
These considerations thus yield an overall synthe- 
sis method for the coding scheme. 


A Simple Example—Gaussian Signals 


We now abandon the level of generality of the 
preceding discussion and illustrate our ideas with 
a specific, though simple, example. The purpose is 
not, so much to exhibit a finished system but rather 
to give a picture of what the operations look like 
in_a simple case. 


We assume that the signals {s?} and {s?} are 
Gaussian to begin with, so that the representations 
(4) and (6) are valid without any additional as- 
sumptions. We shall further confine ourselves to 
symmetric probability densities and coding opera- 
tions which preserve symmetry of the probability 
densities throughout. 


The simplest codes are obtained by considering 
the simplest departure from Gaussian. Thus, in the 
notation already used in this section, let 

ay 
fat S110 sr a fal?) G(z) 
Here in order for the right-side to be non-negative, 
we must have 


(27) 


<2 < 
oa, 4 


The corresponding differential equation for g,(.) 
becomes, with y = g,(x), and x = h(x), 
cy non eres 
y a y as x 
which can be solved explicitly as: 


] 
x = Rly) = Erft E y fae (3y - y?) ai] (28) 


where 
D 
Erf y= ; G(t )dt 


The resulting curve is plotted in Figure 4, using 
for’ad, the largest value possible, namely +4. The 
function h(y) has an inflection point at x = y = 
13; x > y for y < V3, and x < y for y > V3 and 
xy aS Yo. x is an odd function of y and can 
be expanded in odd powers of y using (28). 


Since the largest value of a, corresponds to 
the largest departure from Gaussian possible with 
the representation (27), we take a, = 4 for our 
coder for signal #1. For signal #2, we must choose 
{8,} lor g,(.)] so that the choice will lead to 
the smallest error in the extraction process. In 
this simple discussion, we bypass the variational 
problem involved here and suggest a heuristic 
argument. Thus, from the work in Part I, it would 
appear that to obtain a small error we have to im 
part a departure from Gaussian in the opposite 
sense, so to speak, to Nn», and choose B, ~ - a. 
[since for identical statistics ¢, = 0 for n= 2 
in (19)]. We do this by choosing for g,(.) the 
functional inverse of g,(.) so that 


g2(.) = hl.) 
For this transformation, the resulting density of 
%, can be readily deduced and is seen to be: 


G(z) 


P,, 2) = (29) 


a, : 
il SF ri H,[g,(z)] 


170 


To a first approximation g,(z) ~ z, so that further 


Oy 
ad ~ E = a nytt 


which serves as an indication of the reasonableness 
of the choice.. [An exact evaluation of the coef- 
ficients B, is possible through the identity 


(30) 


t2 


Oy 
fli an n)| Gly) Exp it fly) dy = Exp ease 


from which it is also apparent that signal power 
.is not preserved exactly, but only approximately 
so.] This choice of the 2nd coder is an advantage 
from a system point of view since the decoders do 
not involve a new design. 


The probability distributions P;,(z) and 
P,,(z) are plotted in Figures 5 and 6 respectively. 


We next come to the extractor. The form of the 
extractor for the transmitted signal is determined 
either from (15) or (19). If we use (19) we-can 
obtain P,(z) as in reference 3, noting now that 
all the odd moments vanish because of our symmetry 
assumption. We shall note here the expressions for 
the first three polynomials 


JA72)) Ges 
z 
P.(z)= = 
‘ V2 
[Vee 22 - fy 2 
P.(z) = 


iy) ee 
Ving lity He ~ Ha) 


where 1,, Hy, i, are the moments of Cf, + 7, and 
can be evaluated in terms of the a's and B,'S- 
Thus 


owas 
+ 2: 


(a, + B,) + 30u, - 240 


Hy, = 12 + a, 


Me 


The corresponding coefficients are 


iv) 
ul 
= 


(oe uews,.) 
cs We RET Ne Te 
(Hole = Hy) He 


aba 


The extraction error using only the first 3 poly- 
nomials — or a "3rd order" coder — is thus 


1 anlemeee 
2102 |— 4 — see (32 
2.” Queer) 
Here a, = 4; a, = O- £8, and &, can be computed 


using (31) and making appropriate correction for 
unit power, but we can obtain a rough lower bound 
for the error by using the approximete form (30) 
and setting 6, = 0, 8,=-4 in (32). This yields the 
value 1/6. 


This error is, however, tcc large and hence 
higher order coders and extractors have to be used 
to obtain a smaller error, if the assumed design 
philosophy is to be followed. We also note that the 
smaller the extraction error is, the closer the ex- 
tractor-decoder scheme will be tu the optimum, and 
the better the system performance. 


~- 


The decoder being the inverse of the coder pre- 
sents no additional design problem. The overall 
block diagram is shown in Figure 7, where both the 
spectrum shaping filter of the extractor and the 
equalizer of the decoder have been omitted since 
they are mutually reciprocal. 


A similar extractor-decoder scheme can be used 
for the second signal. Here advantage may also be 
taken of the fact that the optimal estimate N? of 
any order is given by 


where s* is the optimal estimate of s, to the same 


order, for possible simplification of the extractor. 


REFERENCES 


1. J-L. Doob, Stochastic Processes, John Wiley and 
Sons, New York, 1953. 


2. N. Wiener, Extrapolation, Interpolation and 
Smoothing of Stationary Time Series, John Wiley 
and Sons, New York, 1950. 


3. H. Cramer, Mathematical Methods of head 
Princeton University Press, 1946. 


4. L.A. Zadeh, A Contribution to the Theory of Non- 
Linear Systems, Franklin Institute, 255. 387- 
408, 1953. 


bn 


SIGNAL Sp, 
EQUALIZER ZERO-MEMORY SPECTRUM EQUALIZER ZERO-MEMORY SPECTRUM 
(SPECTRUM DEVICE RESTORER (SPECTRUM NON-LINEAR RESTORER 
FLATTENER) FLATTENER) DEVICE 
: : . : : ore ete : | i? i 
Fig. 1 - Optimal non-linear extraction filter. Fig. 3 Structure of coding filters 
3.0 


SIGNAL N2 | 


SIGNAL N21 


SIGNAL N22 


SIGNAL N2 2 


Fig. 2 - Block schematic of bandwidth-reduction system. 2.0 
x 
a 


ee | 
Ee 
(e) | 2 S 4 
x 
Fig. 6 - Probability density of 7, 
TRANSMITTED DECODED 
SIGNAL SIGNAL 
EQUALIZER ZERO- DECODER LINEAR 
SPECTRUM MEMORY ZERO SPECTRUM 
FLATTENER PART- OF MEMORY SHAPING 
OPTIMAL DEVICE FILTER 
EXTRACTION 
FILTER 


“ig. 7 - Block schematic of coder-extractor. 


Fig. 5 - Probability density of 7, 


172 


FINAL-VALUE SYSTEMS WITH GAUSSIAN INPUTS 


by 
Richard C. Booton, Jr. 
Department of Electrical Engineering 


Research Laboratory of Electronics 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


Abstract 


A final-value system controls a response 
variable r(t) over a time interval (0,T) with 
the objective of minimizing the difference 
between a desired value p and the final response 
value r(T). An ensemble of situations is 
considered, and the system input i(t) and the 
desired response p are random variables that are 
statistically related. Physical limitations of 
the element being controlled result in a maximum 
value constraint on the system velocity r'(t). 
Earlier results! suggest that a system consist- 
ing of an estimator followed by a "bang-bang" 
servo is approximately optimum. The estimator 
uses the input to produce an estimate p* of tke 
desired response and the servo results in a 
system velocity as large in magnitude as possible 
and with the same sign as the difference P* - r. 
The present paper shows that this system is the 
true optimum when the joint distribution of the 
input and the desired response is Gaussian and 
the error criterion is minimization of the 
average of a nondecreasing function of the magni- 
tude of the error. 


Statement of the Problem 


For mathematical convenience, the continu- 
ous time variable is replaced by a discrete set 
of n values, separated by an interval h where 

aes 
h=< (1) 
The input is characterized by a sequence of 
values ij, io, ese, i, and the response by a 
sequence r,, r5, °**, r.. Hach response r, is 


a function onl¥ Of 157, ty, eee, i.- The x 
constraint on velocity 

br(t)h< v(t) (2) 
is replaced by 

ice - ul <hvy (3) 


The error criterion used is the minimiza- 
tion of a "cost" which is the average of a 
function of the final error. This cost is expre- 
sed in terms of the desired response p and the 
final response r_ as 


C= af 1 - s) } (4) 


where E denotes the averaging operation 
(expectation). The function f is assumed to be 
an even uniminimal function, in the sense that 


f(x) = f(-x) (5) 
and 
f'(x)> 0 for x >0 (6) 


The cost C can be expressed in terms of the 
joint probability density p 
and p as 


(o) = Joref 20-19) yg(1y9°°* sigs) aly + a 


The problem is to minimize the C given by this 
expression subject to the constraint (3). 


Gaussian Variables 


The density function p, is Gaussian, and 
hence a set of independent $riables can be 
introduced to simplify the calculations. With 
the conditional mean of p given i> eee, 1. 
denoted by PL, that is 


tp of 4, 15, s+, 4, 


at = BY of tyres, uf (7) 
the variables defined by 
* 
mpl a) 
* * 
Fouad 
= pe * (8) 
= Pn Pn-1 
ae: cd 
aera a 


are uncorrelated, as shown in Appendix A, and 

thus form a set of mtually independent Gaussian 
variables. With the probability density function 
of xX denoted by P,» the cost can be expressed as 


c =f 3 at (4434, Fy) Pata (%n41) A "Py (4) Go ya OH, 


(9) 


The Optimum Systen 
The cost expression can be rewritten as 


¢ =f---fe,(o,*-r.)p,(x,)+**P(x)ex,2+2dx (10) 


*this work was supported in part by the Signal Corps, the Office of Scientific Research (Air Research 
and Development Command), and the Office of Naval Research, of the United States. 


1 


R.C. Booton, Jr., "Optimum Design of Final-Value Control Systems", Proceedings of the Polytechnic 


Institute of Brooklyn Symposium on Nonlinear Network Analysis, 1956. 


165) 


where 
% * 

f(y -Ty) =f £(x414Pp Ty) Paya %p4q) Oar (22) 

Because f is an even uniminimal function and 

Pay is an even unimaximal function, the result 

iB* Appendix B implies that £, is an even unimini- 

mal function. The function r(i,, ***, 4, 

minimizes C if for each set of values (1,5 ***,4,) 

the value of r_ is chosen to minimize f.. Because 

the minim value of f (x) is achieved at x = 0 

and fh is nondecreasing away from this value, 

the optimum solution is to set r,_ =p” if the 

constraint allows and otherwise Bo set p* -r 

as close to zero as allowed by the constraint. 

Thus, subject to the constraint 


oj 7 BV Sete ttn DVS (12) 
the value of Th that minimizes fn is 
De einige ea keg L,(?, 3 r-1) (13) 
where Li is the limiter function defined by 
L(x) = -hV,, for x < -hV,, 
Dax for |x| < hv, (14) 
= hv, for x > hv, 


With this choice of Ponte is a function of 
Pnr*- Ty1- With this function denoted by g, 


* * % 
£,(Pn -T) = fy [>, “Ty7 7 byl?y -~,.1)| 


* (15) 
= €,(0, ~ Tp.) 
This can be expressed as 
4 % 
fle n -Ty) = 8, (Xt? ya te (16) 


Substitution of this relation into the cost 
expression (10) gives 


o=|. a Jentaytog ata.) P(X) Ae “Py (%) neae - ta 
(17) 
which can be written as 
o=/- 7 ‘J fit Paeastna) Ppa (%q_3) °° Pq) 
dx i° ° “dx, (18) 


where 

+ ) * yan reeyiis 
f,-'0 n-1 *n-1 =fe,,(x,+0 n-1-Tn-1’ Pann! *n (19) 
The function is an even uniminimal function 


and hence, by the result of Appendix B, the 
function f,_, also is an even uniminimal 
function. 

The reasoning used to determine r_ as (13) 
can be repeated to determine a similar relation 
for r_,, and the entire procedure can be 
repeated so that, by induction, each response 
value T; is given by an expression 


17h 


2 
ppieie oy ee eS rey) (20) 


where 
L, (x) = -hV, for x<¢ -hV, 


x for |x|<hv,. (21) 


= hv, for x > hv, 


The expression (20) can be written in the form 


rm, - 7 
k k-1 _1 oe 
In the limit as h approaches zero, this becomes 
d * 
ag rit) = V(t) sgn |p (t) - r(t) (23) 
where 
sgn x = -l Seber (8) 
= 0 if?xv="0 (24) 
=1 ifx>0O 


This is the principal result of the paper. 


Appendix A 


The purpose of this appendix is to demon- 
strate that the variables x, defined by (8) are 
uncorrelated. Each conditional mean p,* is the 
best mean-square estimate of p given i,, °°’, i. 
and the general properties of mean-square estima- 
tion imply that 


* 
where Ae is any function of Lyse**s is In 
particular, 
fio - mtm} =o for ki Vomccc.s 
(A-2) 


because each X is a function of the input 
values. 


Use of 
E G = 0, on - “1 oO ¢A=3) 
shows that 
rfe-rgt) t= rfo-r,°| +E Pn -Pyaa)” | (A-4) 


This expression and the fact that p =i is the 
best estimate of p imply that p*) #s-the best 
estimate of pre and 

for k=1,-++,n-1 


E Korat = 0 
(A-5) 


By induction, each p ‘, fork =1, «--,n-l, can 
be shown to be the bést estimate of py and 
thus 7 


(A-1) 


E {oga-oy"a4} = 0 for k=1,°°", n-1 (A-6) 


The equations (A-2) and (A-6) are equivalent to 
E fers} 0 for all j, k=1,--+,n+l (A-7) 


Appendix B 


The purpose of this appendix is to show 
that the function H defined by 


H(y) = F(x + y)G(x)dx (B-1) 
where “C2 

F(x) = F(-x) } (ae) 

F'(x) > 0 for x 30 
and 

G(x) = G(-x) 

G(x) <0 for x 20 | (B-3) 
is even and 

H'(y)20O fory20O (B-4) 
By a change-of perce. (B-1) can be written as 

H(y) = fe F(-xty) G(-x) dx (B-5) 
Because F and & are even 

u(y) =f F(x-¥)G(x)ax (B-6) 
and thus ~@® 

H(y) = H(-y) (B-7) 


175 


Differentiation of (B-1) yields 


at (y)=f F* (x+y) G(x) ax (B-8) 
By a change of variable, 
Bt(y) =f FY(x)G(x-y) a2 (B-9) 


which can be rewritten as 


H'(y) = fF (a)alaca)ax ae F' (x)G(x-y) dx (B-10) 
-@& 
A change. om variable in the second integral gives 


H'(y) = [-# (wae sail F! (-x)G(-x-y) dx 
v (B-11) 

Because F! is odd and G is even 
a(9) = f(x) |ocex) G(xty) | ax (B-12) 
For y2 0 and x 2 0 

[x - yl 4 &£x+y (B-13) 
and 

G(x - y) » G(x + y) (B-14) 


This inequality and (B-2) simply that the 
integral in (B-12) is non-negative and thus 
(B-4) holds. 


AN EXTENSION OF THE MINIMUM MEAN SQUARE PREDICTION. 
THEORY FOR SAMPLED INPUT SIGNALS 


Marvin Blum 
CONVAIR 
A Division of General Dynamics Corporation 
San Diego, California 


Abstract 


A method is developed for finding the or- 
dinates of a digital filter which will produce 
a general linear operator of the signal S(t) 
such that the mean square error of prediction 
will be a minimum. The input to the filter 
is sampled at intervals At. The samples con- 
tain stationary noise N(jAt), a stationary sig- 
nal component, M(jAt), and a nonrandom sig- 
nal component, 


n 
P(jAt) = x a, Py (jAt) 
k=o 


where the subset of nonrandom functions P, ( 
are known a priori, but the parameter ape 
as (a. Aree a.) need not be. 


The solution is obtained as a matrix equa- 
tion which relates the ordinates of the digital 
filter to the autocorrelation properties of M(t) 
and N(t) and the nature of the prediction oper- 
ation. 


Introduction 


The central problem in prediction lies in 
the fact that when a filter operates on an input 
signal to produce a desired output, there is usu- 
ally random noise superimposed on the message 
which prevents determination of the desired 
output without error. It may be desirable 
therefore, to select afilter which is optimum 
in the following sense: The filter will produce 
an output whose ensemble average value is equal 
to the ensemble average value of the desired 
signal, and the mean square difference between 
the actual output of the filter and the desired 
output will be a minimum. 


Norbert Wiener considered this problem in 
his classical work on the subject. He considered 
the input function to be a continuous random sta- 
tionary time series and the noise to be another 
continuous random stationary time series, and 
derived the solution for an optimum filter in the 
form of the Wiener-Hopf integral equation. It 


was required that the filter have a semi-infinite 
memory, e.g., the filter extended in the time 
domain from 0 to oo. The work of Phillip and 
Weiss“ extended the theory to include a nonran- 
dom input signal in the form of a polynomial of 
known degree. A further extension was made 

by Zadeh and Ragazzini% in that they derived 
the equations for the optimum filter for a more 
complicated model. In their model the message 
signal consisted of a random stationary signal 
plus a polynomial of known degree, while the 
noise was random and stationary. In these three 
theories the autocorrelation of the noise and sta- 
tionary signal are presumed to be known. In the 
last two the memory of the filter is finite, e.g., 
the impulsive admittance of the filter is zero 
outside some finite interval. 


In a recent paper by Lees, 4 the solution to 
the Zadeh-Raggazini prediction model for sam- 
pled data was presented. The solution is ob- 
tained in the form of a weighing function which 
is piecewise analytical. The output of this filter 
is continuous as opposed to the output at sampled 
points only, of the digital filter. 


However, by an extension of the prediction 
of the digital filter from a fixed parameter con- 
cept to a variable prediction, one can obtain the 
same results as shown by Lees. The extension 
of the digital filter to the piecewise analytic filter 
is presented for the more general input model 
considered in this paper. 


A simplified solution for the piecewise ana- 
lytic filter is presented where the input is a poly- 
nomial of degree n plus white noise, and the de- 
sired output is the predicted value of the kth der- 
ivative of the input. These results are presented 
in the appendix. 


In considering the solution of the prediction 
problem for discrete data, the similarity between 
the filter problem and the problems of curve fit- 
ting becomes apparent. In a paper by A. C. 
Aitken? the problem of the best linear transfor- 
mation on a set of observations to fit a general 
class of nonrandom functions to the data is con- 
sidered. His results are easily interpreted in 


terms of a digital linear filter which smooths the 
input data. The importance of the approach is 
that it admits of a much broader class of nonran- 
dom inputs. 
general restrictions stated within) which is a lin- 
ear function of known functions without knowing 
the particular linear relationship. 


In this paper the optimum digital linear fil- 
ter is determined for the following conditions: 


(a) The input signal to the digital filter con- 
sists of a stationary component plus a non- 


random component (from a more general class 


of nonstationary components than polynomials) 
plus random noise -- sampled at equidistant 
intervals. 


(b) The prediction is a generalized linear op- 
erator of the message components. A parti- 
cular application is made to derivative pre- 
dictions. 


The solution takes these forms: 


(a) A matrix equation is derived which re- 
lates the ordinates of the weighting function 
to the stationary correlation properties of 
both signal and noise, the prediction opera- 
tion, and the values of the n+1 subfunctions 
sampled at the m+1 equidistant points. 


(b) The solution requires a weighting function 
with a finite memory, e. g., the discrete im- 
pulsive admittance of the filter is required to 
vanish outside an interval 0 St © mAt where 
At is the sampling interval and mAt is the 
finite memory of the filter. 


(c) The smoothing operation on the data is in- 
terpreted as a dynamic curve-fitting proce- 
dure which yields minimum variance esti- 
mates as a series of outputs in real time. 
The first output becomes available when the 
first full series of m+1 data points are in 
the filter. 


As succeeding data points become available, an 
optimum estimate can be obtained by a linear 
weighting of the most recent data point and the m 
previous data points. This sliding arc procedure 
requires in general a time-varying weighting se- 
quence, but for a special subclass of the input 
function, the weighting sequence is time-invari- 
ant, @. g., the same coefficients in the weighting 
sequence multiply the input data at a fixed lag 
with respect to the most recent data point, even 
though the data changes with changing time. 


177 


One can use an input (subject to some 


Definition of Input Model and Weighting Sequence 


Let. the input to the filter be 
e(t) =S(t) + N(t) (1 
where 
S(t) = P(t) + M(t) (2 


and P(t) is the nonrandom part of the signal, M(t) 
is the stationary part of the signal, and Nit) is 
the stationary noise. 


Let E(z). be defined as the expected value of z, 
or the ensemble average of z. 


Then it is assumed that- 
Et [Mit}] = = [NC] = 0 (3 


for all t, and that 
n 


P(t) = 2 a,P,(t). (4 


That is, P(t) can be represented as a linear com- 
bination of the subset of n + 1 functions P;,(t) 
where k=0, 1, 2, ..n. Knowledge of the 
functions P;,(t) is required a priori, but the value 
of the parameter vector (a) = Ag, ay, ag, - - ay 
need not be known. From equation 1 and 2 the 
sample input to the filter at time t = (jAt) is 
given by 


e(jAt) = N(jAt) + M(jAt) + P(jAt). (5 
Let W'(t) be the impulsive admittance of the fil- 


ter. Then the output of the digital filter at time 
t = (m+u)At where u =0, 1, 2, . . .) is given by 


@ 

ex [(m+u)at] = = AtW"(jAt)e[(m+u-jat] a6 
j=0 | 

Let W, = Atwr(jat), then 


£ [(mta-jad - emtu-j ey Niatu-j 
a Mn fu5} + P(m+u-j), 


and 


* + = e* 
e* (m+u)At = e nd Newby 


Let us assume that m>n and that W. = 0 wherever 
j<O or j>m. Then equation 6 can be written as 


T In this case E [(+)] refers to the ensemble 
average of M at time T. 


m 


S Ww: [x ris ¥ 
0 JL mt 


e* = 


m+u m-+u-j 


" P(m+u-))| (7 


The vector (W,, }= [%o, w AW ne » Wo aa 


Wn, u| will be noted as tthe weighting sequence. 


In the subsequent solution for the weighting 
sequence the parameter u will be taken equal to 
zero and deleted from the notation. Where ne- 
cessary certain matrixes will be defined for u 
not equal to zero. The solution for u equal to 
zero will be presented and then the modification 
of the solution for u not equal to zero will be 
given. 


Generalized Desired Output 


Let S*(mAt) be the desired output. Since the 
desired output is defined as a fixed linear oper- 
ation of the input one may relate the impulsive 
response of an ideal predictor k‘(t), to the in- 
put and output relationships, by an equation of 
the form 


+ © 
s*(t) = i k'(r)S(t-r)dr (8 
-@ 
For t = mAt, using equations 2 and 8, one ob- 


tains 
+ 0 


S*(mAt) = f k' (r)P(mAt-7)dt 


- @ 
+0 


F if k'(T)M(mAt-r)dr. (9 


-@ 


Using equation 4 one obtains 


“4 +00 
S*(mAt) = 2 k'(T)P, (mAt-7)d 
xo J x : 
-@ 
+00 
+ af k'(rt)M(mAt-t)dr. (10 
- © 
Let oo 
Q,. = f k'(7)P, (mAt-r)dr (11 


- @ 


and ep 
Qu = i k'(7)P,, [(m+uyat-r dr. (12 
- © 


The matrixes {Q| and \Q,, are defined as 


|Q| a Q [Q,,| rs Qu 
Q; Qu 
Q, Qnou 3 (13 
Let 
+00 
Mo =f k*(7T)M(mAt-7)drT . (14 
- 00 
Then 
s* (mat) = (a) |Q| + M° (15 
where (a)! is the row matrix (a) = (ap, a p 39 


- a). 


As an example, consider the relationship 


d_ s(t) 
* =— 
St ne eae | t = (m+a)At. (16 
Equation 16 reads, the desired output in real 
time at t = mAt is the value of the first derivative 
of the input function evaluated at t = (m+a)At. 
Equation 11 may ke written for equation 16 as 


P, (t) 


Q, = x t = (m+a)At (17 


and equation 14 becomes 


ome. M(t) | 
) dt. > |it = (mta)At. (18 


Equations of Constraint on (W) 


Let the error be defined as 


Rye e* (mAt) - S* (mAt) (19 


where €, is the difference between the actual 
output e* (mAt) and the desired output S* (mAt) 

at time t = mAt. One seeks the optimum weight- 
ing function (W) which is inferred from the follow- 
ing conditions on cu 


T () indicates a row matrix, while // indicates 
a column matrix. 


E(e) = 0 (20 (a 


— 
ul 


minimum. (b 
Equation 19 and requirement 20a infer that 


E [e*(mAt)] = E[S*(mat] . (21 
Using equation 15 one finds 
E(S*mAt) = (a) |Q| . (22 


Using equation 3, 7, and 21 leads to 


m 
E(e*(mAT))= <2 W, P(m-j). (23 
j=0 ? 


Substituting equation 4 into equation 23 one obtains 


min 
E [e*(mat)] B B zi : w, P\(m-j) a, - (24 


Equation 24 can be written in matrix form as 
follows: 


E [e*(mat) = (a) P |w| (25 
where 
(a) = (ao; ay OS. a.) ’ 
and 
|w| = (W)', e.g.,- transpose of W = Wy 
W, 
a 
m 9 
and 
P= P)(m) Py(m-1) elolaetsetans P (0) 
P,(m) P| (m-1) ee Pea P (0) 
|? (m) P(m-1).. 2... P_(0) 
and 
Pe = Po(mtu) ate bo aieete Po(u) 
P, (mtu) BS tele P,(u) 
P (mtu) ie be 4 P (4) 


179 


Note: A prime indicates that the transpose 
of a matrix has been taken. 


From equations 21, 22, and 25 it follows 
that, 


(a) |Q| =(a)P |w. (26 


Since relationship 26 must hold for arbitary (a), 
it follows that, 


lai = P |w (27 
Equation 27 represents a set of (n + 1) linear 


constraints on the (m + 1) ordinates of the weight- 
ing function (W) required by 20a. 


Evaluation of Ee 7) 
Let 
ee : 
E(N,N,) =0"p [G-k)At] (28 
be the correlation function of N(t) and 
2 P 
E(M,M,) = 7" x [(j-k)At] (29 


be the correlation function of M(t) where 


p(0)=r(0)=1 (30 
and let 
E(MN_) =0 (31 
for allwandv. Then from equations 7, 15, 19 
and 25 
9 m 
E(e")=E [m?- = WAN, +M »|? . (32 
j=0 J m-j m-) 
Let 
= yr lGathe ore (GIat 
pier 7 0 | p [G-k) | 
j =0, 1, 2, ...m and 
k = 0, 1, 2, ...m (33 


and let V be the (m+1) x (m+1) matrix whose 


elements are eet k+l, Note that Visa 


symmetric matrix as is its inverse due to the 
assumption that the random components are 
stationary. 


Define 


(Oe i ay coe ahaa) (34 


_ atone 
6, = E (MM, .) (35 
Los e|(m?)7 (36 


Then L is not a function of the weighting sequence. 
Substituting 34, 35, and 36 into 32, one obtains 


2 
E (c”_.) = (Ww) V [w|-2(8) |w}+L?. (37 
When 
aK 
s*(mat) = <—s(t) 
at t = (mta)At (38 
then 
Be) 
ak t = (mta)At 
and 
2K 
2 < 74 |hicl r(t 
rhea! tps tcc (40 
dt t=0 
and 
oe anit) 
y at* t = (j+a)At (41 


Minimization of E(e" ) with respect to (W) 
ES ER Te PEAT I 0 0 WLI BA RE AB SOE EN 


It is required to minimize E(c" ) = (W) 


V|Wl - 2(8) |W] + Te with respect to (W), subject 
to the constraints |Q| =P IWI. 


Let 


Ig} = (Ww) V |W - 2(8) |w| 


ye [a - (Ww) P*| nN (42 
where 2 |A| = 2A is a column matrix of (n+1) 
2X 
n 


Then the minimum 


0 = 2V |W} 2/8) 


Lagrangian multipliers. 
2 . : 0g _ 
E(e ec) is given by Les 


- 2P' JA] = 0, so that 


v| w= [a] +P" al. (43 


180 


Using the restraint equations |Q| = P |WI the un- 
known Lagrangian multipliers are eliminated and 
the equations solved for |WI giving, 


wl = vial +v Pt pvp: {Ql 


- vip (pv Pty? py? [al (44 


where the superscript -1 indicates the inverse of 
the matrix. It is assumed that the matrices V 


and [pv pr ] are nonsingular so that their in- 
verses are uniquely defined. 


Interpretation of the Weighting Sequence 
as a Sliding Arc Technique 


The output of-the digital filter at time 
t = mAt is given by 


m 
ef =f We (45 
Mm 4-0 k m-k 
Suppose one requires the output in real time at 
t = (mtu)At where u=1, 2, 3, . .. Then one 
may write 
m 
* & 2 
°m+u ee Wu © mtu-k (4 


where the e represent the m+1 input values 


mt+u-k 

from time t = (mt+u)At to uAt. The weighting 

sequences are now interpreted as a set of 

sliding weights such that Wo < weights the most 
d 


recent data point in time and As : weights the 


last input which occurs mAt previous to the most 
recent data point. For the general class of in- 


put functions P(t), the weighting sequence Wie 
- ’ 


is time varying, that is it changes for each 


value of u. The optimum weighting sequence is 
given by 
note shee =1ee-2 
\w,|. =v lal+v Pe | Pav Pi [24 
RAS lor elin4 -1 
VP eay ea ih [el . (47 


Notice that the only changes required are the sub- 
stitution of BP. for P and Q. for Q. The other 


matrices do not change since the random com- 
ponents of the input are assumed to be stationary. 


Time Invariant Weighting Sequence 


For a certain class of input functions P(t) the 


weighting sequences have the property that W,. ee 
> 


W,. where Ww. is given by equation 44, u = 1, 2, 


-, andk=0, 1, 2, . m. Let H define this 
class of functions. Then H is specified by the 
following properties: 


a) H is ann +1 dimensinnal linear vector 
space with basis P(t), k= 0;) 1752) 
oan. 


b) A translation in time is linear in H,*e.g., 


n R 
2» a,.P, (tz) = ' (r)P  (t) for all 7. 48 
eens ee oak 

Examples of members of this class are arbi- 
trary polynominals of degree n, exoonential 
functions of the form exo(ctjd) t, sums of the 
form (a sin wt + b cos wt) and products of the 
above functions. These combinations can be 
summarized by stating that the complete set of 
solutions to a homogeneous linear differential 
equation with constant coefficients of order ntl 
belongs to H. 

As an example if P(t)=a - then P(t) does 

not belong to H since P(t+T) = =a (t2 + 2Tt + T2) is 
of a different form. One may d?rive an optimum 
filter which will predict for the function P(t) = at 
and which will require only one equation of 
constraint. However the weighting sequence of 
the filter will not be time invariant. If one derived 
a filter for the more general function P(t) = 
ast” + a,t + ay 
be time invariant but would require three equations 
of constraint and have a larger mean square error 
of prediction, The filter designed to predict for 
the general quadratic would also predict correctly 
for the function P(t) =a t2. 


then the weighting sequence would 


A simple example of the time invariant sliding 


arc property of the weighting sequence will be shown. 


(c+jw)t 


Let P(t) = ay e 


_d_ P(t) be 


= ap P(t) , and the desired 


output be 


dt = (m)At , where m = 4. 


a 


lol, and 
Pt =|q 


Let q = e(°#)46. ana Mit) = 0: 
Vv =I. Then P = ke a3, a2, ai, 


q° 
2| 
q 
1 
q 


fo) 
q 


181 


rere 
z qu 
j=0 
d P(t) vd ge Tae 
ears | t = (mat ~ +e) a 
w =p" [PP] 19 | and 
W,, = (ctiw) a” see , so that 
SS, 
q 
j=0 
4 
-K_ aed 
E(e%) a Wag d == Ad(ctiw)a 
4 
4+u-K 
x = 
ae +) = 2% Wy 
E(et, ) = 0 (c+jw)a 4 4+u 
atu!” Tides a4 
z q” 
j=0 
3 3+u Ou 
+qaq oO Pel ies 


E(e4 ay) s ay(ctiw) Aes which is the derivative 


at the end of the prediction interval whose most 
recent point is (4+u)At. 


Special Cases 


Let M(t) = 0; then 8. = 0 for all j and 
equation 47 becomes 


pe Le Sayre 
jw] =v P [Pav P| J2,] - (49 
Further, if the noise is uncorrelated so that 

e (Gi-K)at] = (50 
then vi = ot, where I is the identity matrix. 
Then 

W. =P! [P Pi co (51 

U. qa 8. a 


Finally, if the functions P, (t) are orthogonal, 
e.g., satisfy the condition 


m n 


: ; ‘ P is ; ; 3a 
= Py tu) P; (itu) = 6, (52 P(jdt)=Z a,P, (jAt) ( 
j=0 k=0 
= 1 Oh 
pete tke z The P (jdt) are selected to be orthogonal over 
Then Pees = I and equation 47 becomes the interval j= 0),1,. 29) |) gimoe fiacis P, (jt) 
w | - pr ja | (53 satisfies the relationship 
u Oe Na 
m . 
2 P) (jAt) P, (jAt) = O Stk, m+) - (4a 
Modified Wiener Model - 
The first few polynomials are ziven by the 
Let P(t) = 0 and S(t) = M(t) then ee 
m €, (At) 
e*(mAt) = = W. ins _+N | PuGAt) etek eS (5a 
; = = k 
Seg albu AN = sd fa [S (k, m+1)] ue 
-OD where 
S*(mAt) = f x'(T) M(mAt-t) dr and € (iat) oe 
e 5 m+2 
€ (JAt) = At( (+1) - =) (6a 
€_ = e*(mAt) - S* (mAt) E(e_ )=90. 2 
m m : Php (mti) = 1 
9 €5(jAt) = €, GAt) Lier 
Equation 37 gives the value of E(e.) 
2 m 
cE(e_ ) an 
0 eel ee ae Ae yk and S(k,m+1)= 2 <¢, (jdt). 
w= VV IWi-[a]=02. wl=v [a] = 4 j-o * 
Conclusion Higher order polynomials can be obtained from 
reference 6. - 
A general relationship for the optimum ; 
weighting sequence has been derived such that Let the desired output be 
the mean square error of prediction is a ‘ au P(t) 
minimum. The model for the input signal contains S*(mAt) = — . 
a nonstationary component which is an arbitrary dt t= (mvra)At . (Va 
lin2ar combination of n+1 known functions plus a 
stationary random component and stationary noise. Then L 
The condition for which the weighting function can d P,(t 
be utilized as a time invariant sliding arc has been Q=— 
stated and an example of this technique applied. k at’ = (m+a)At 
Ajp2end:x I: Extension of the Digital Filter toa 
Continuous, Piecewise Analytic Filter = ah [+ a) at] ; (8a 
Let the input to the digital filter consist of a Then the weighting sequence by equation 53 is 
polynomial of degree n and white noise. Then (for u equal to zero) 
M(t) = 0 (la lw| = P* |a| (9a 
and 
so that 
p [G-K)at ] = 5. (2a z 
Ws = 2 QP iy [ (m-v)At ] ’ (10a 
The input polynomial is defined by k=L 


182 


where v= 0, 1, 2, ..m. Thus for n= 1 and 
L = 0 one has a linear input and an m + 1 point 
least squares curve fit. The output is the pre- 
dicted value of the input where oAt is the pre- 
diction interval. 


Then 
eae 29 [(m-v)At] oT ((m+a}at] Fe 
v m+tl S (1, m+1) , se 
and using equation 6a one has 
1 
‘1 =, 
ae ~ m+l 
re 2 mtt- -v) - m2 Ub nvtt) asd 
TTT CiOCé*F/(i1i22 


S (1, m+1) 


If now the variable a+Aq is substituted for a 
where 0 = Aa £1, then W_ is a continuous 
function of Aa. This substitution is equivalent 
to a continuous minimum variance prediction 
over the interval between the samples Dased on 
the previous knowledge of the last m + 1 sample 
points. When an additional data point is 
sampled, then the constant corresponding to 
Aa=0is used. That is 


\\ ‘ = / 
W (Aa) = W. (Aa + d) (13a 


where A =.0, 20912 F125. 4. 


The We are cyclic 
with a period of unity. 


Upon making the above substitution in 
equation 12a) one has 


ie 1 Bd + en 


Vv oy m1) (149 
(Note: At is taken equal to unity. ) 
For m = 2, 
1 1 
Dee et ae Pe 15 
Wy 37 2 El Presa] oF 
Since €,(0) = -i, es(1)' = 0, e) (2) = 1 then 
Si7, 3) zy: from mich one obtains, 
5, atAda 
fo. 6. 2 
1 
We 1 
Ww 3 (16a 
N37 eS 1 ppt! 
Sage pet oo 


183 


The above solution checks with the functions 

uit), andu,(t), and u,(t) presented Dy Lees for 
1 2 

this same model. 


As a second example coisider a I, L=0, 
= 3. Then 


(17a 


Since € (0) ==3/2; €,(1) = -1/2, € (2) S28 


3 
«.(3)'2 372 “hen Stl ay eee he) One 
1 j=0 1 4 
evaluates 
o 5 >) as { t 
w -i, 8- 2M) B+ 2 (waa) (18a 
v 4 20 
where v= 0, 1, 2, and 3, and odtains 
zs 7 Se 3 (or-Aad u / ) 
Woe 10 aloe 
_4+1(aAa) = u,(Aa) 
Wee i0 : 
Vth a iitaraas | : 
Wo= a har aS (Aa (19a 
w, = BES) oy (ea) 


These results check with the functions obtained 
by Lees 


Approaching the solution from the point of 
view of the continuous extension of ato atAa, 
certain properties become apparent for the above 
input model as follows: 


a) The mean squire <iror output is smallest 
at the times corresponding to the sampling of a 
new data point and increases monotonically until 
the next data point. At the next data point the 
mean square error returns to the previous 
value. Thus 


s*(Ad) = o7Aa=A) (29a 
fOYN =O; 1 That is, the mean 
square error is periodic. 
b) The smallest possible value of 02 is 
ziven when 
a+ aas- [Pe : (2la 


c) The values of 0” and W (Aa) are functions 


of variables a and Aqin the form a-Aq only. 


d) For an pe derivative output W,(atAa) 


is a polynomial of degree n- L, (n 2 L) in 
(a+Aq) and a polynomial of degree n in v. 


If the analytic extension of ato a-Aqis 
taken as a solution to the optimum continuous 
filter for discrete data then one may use 
equations 44 and 47 as the solution to this 
problem. One need only substitute wtAafor @ 
in the Q and B matricies since, if one assumes 
stationary noise, the other matrixes are not 
affected. It is not claimed by the author that 
the extension is proved herein. From an 
intuitive point of view however since one has 
no information other than at the sampled 
Doints, it would seem that the best one could 
do between samples is to extrapolate in an 
optimum manner using the minimum variance 
filter. 


The fact that this procedure checks out in 
the case tested lends qualitative weight to use 
of the same principle in the extended model. 
For the class of inputs belonging to H, the 
weiguting function will be time invariant and 
the periodic properties of the mean square error 
will be the same as previously discussed. For 
the more general input function not in H, the 
filter will be a time varying function. Thus as 
each sample is presented to the filter, a new 
set of m + 1 functions W_(a@tAq) will be re- 
quired, as given by equation 47. 


18) 


The mean square error will be smallest 


at the sarnpled points and increase monotoni- 
cally till the next sample point. At the next 
sample point the mean square error changes dis- 
continuously to the smallest value for the next 
interval. , 


Refereaces 


Norbert Wiener, "Extroyolation, Interpolation 
and Smoothing of Stationary Time Series with 
Enzineering Application", Cambridge Tech- 
nology Press of M.I.T., 1949. 


The National Military Establishment Research 
and Development Board, "Da‘ta Smoothin# Pre- 
diction in Fire Control, Apjgendix B" Report 
Series No. 13, MCG 12/1, 15 August 1948. 


Zadeh and Razazzini, "An Extension of 
Wiener's Theory of Prediction", Journal of 
Applied Physics, Volume 21, July 1950. 


Lees, A. S., “Interpolation and Extrapolation 
of Sampled Data®", IRE Transactions on In- 
formation Theory, Volume IT-2, March 1956. 


Aitken, A. C., "On Least Squares and Linear 


. Combination of Observations", Proceedings 


of Royal Society of Edinburgh, Volume 55, 
1934-35, pp. 42-48. 


Anderson, R. L. and Hauseman, E. E.. 

"Tables of Orthozonal Polynomial Values 
Extended to N = 104", Research Bulletin 

297, April 1942, Ames, Iowa (Iowa State 
Colleze). 


A NEW INTERPRETATION OF INFORMATION RATE 


by 


J. i. KELLY, JR. 
Bell Telephone Laboratories, Incorporated 
Murray Hill, New Jersey 


(Reprinted from B.S.T.J., July 1956) 


ABSTRACT 


If the input symbols to a communication 
channel represent the outcomes of a chance event 
on'which bets are available at odds consistent 
with their probabilities (i.e., "fair" odds), a 
gambler can use the knowledge given him by the 
received symbols to cause his money to grow ex- 
ponentially. The maximum exponential rate of 
growth of the gambler's capital is equal to the 
rate of transmission of information over the 
channel. This result is generalized to include 
the case of arbitrary odds. 


Thus we find a situation in which the trans- 
mission rate has significance even though no cod- 
ing is contemplated. Previously this quantity 
was given significance only by a theorem of 
Shannon's which asserted that, with suitable en- 
coding, binary digits could be transmitted over 
the channel at this rate with an arbitrarily small 
probability of error. 


Introduction 


Shannon defines the rate of transmission 
over a noisy communication channel in terms of 
various probabilities.1 This definition is given 
significance by a theorem which asserts that bi- 
nary digits may be encoded and transmitted over 
the channel at this rate with arbitrarily small 
probability of error. Many workers in the field 
of communication theory have felt a desire to at- 
tach significance to the rate of transmission in 
cases where no coding was contemplated. Some have 
even proceeded on the assumption that such a sig- 
nificance did, in fact, exist. For example, in 
systems where no coding was desirable or even pos- 
sible (such as radar), detectors have been de- 
signed by the criterion of maximum transmission 
rate or, what is the same thing, minimum 
equivocation. Without further analysis such a 
procedure is unjustified. 


The problem then remains of attaching a value 
measure to a communication system in which errors 
are being made at a non-negligible rate, i.e., 
where optimum coding is not being used. In its 
most general formulation this problem seems to 
have but one solution. A cost function must be 
defined on pairs of symbols which tell how bad 
it is to receive a certain symbol when a specified 
signal is transmitted. Furthermore, this cost 
function must be such tht its expected value 
has significance, i.e., a system must be prefer- 
able to another if its average cost is less. 

The utility theory of Von Neumann® shows us one 
way to obtain such a cost function. Generally 


‘of human endeavor. 


185 


this cost function would depend on things external 
to the system and not on the probabilities which 
describe the system, so that its average value 
could not be identified with the rate as defined 
by Shannon. 


The cost function approach is, of course, not 
limited to studies of communication systems, but 
can actually be used to analyze nearly any branch 
The author believes that it is 
too general to shed any light on the specific prob- 
lems of communication theory. The distinguish- 
ing feature of a communication system is that the 
ultimate receiver (thought of here as a person) 
is in a position to profit from any knowledge of 
the input symbols or even from a better estimate 
of their probabilities. A cost function, if it 
is supposed to apply to a communication system, 
must somehow reflect this feature. The point here 
is that an arbitrary combination of a statistical 
transducer (i.e., a channel) and a cost function 
does not necessarily constitute a communication 
system. In fact(not knowing the exact definition 
of a communication system on which the above state- 
ments are tacitly based) the author would not 
know how to test such an arbitrary combination to 
see if it were a communication system. 


What can be done, however, is to take some 
real-life situation which seems to possess the 
essential features of a communication problem, 
and to analyze it without the introduction of an 
arbitrary cost function. The situation which will 
be chosen here is one in which a gambler uses 
knowledge of the received symbols of a communica- 
tion channel in order to make profitable bets on 
the transmitted symbols. 


The Gambler With A Private Wire 


Let us consider a communication channel which 
is used to transmit the results of a chance situa- 
tion before those results become common knowledge, 
so that a gambler may still place bets at the ori- 
ginal odds. Consider first the case of a noise- 
less binary channel, which might be used, for ex- 
ample, to transmit the results of a series of base- 
ball games between two equally matched teams. The 
gambler could obtain even money bets even though he 
already knew the result of each game. The amount 
of money he could make would depend only on how 
much he chose to bet. How much would he bet? 
Probably all he had since he would win with cer- 
tainty. In this case his capital would grow ex- 
ponentially and after n bets he would have 2 
times his original bankroll. This exponential 
growth of capital is not uncommon in economics. 

In fact, if the binary digits in the above channel 


were arriving at the rate of one per week, the se- 
quence of bets would have the value of an invest- 
ment paying 100 per cent interest per week com- 
pounded weekly. We will make use of a quantity G 
called the exponential rate of growth of the gam- 
bler's capital, where 


‘f Ne 
G = pig. 5 108 7 


where V_ is the gambler's capital after n bets, 
V_ is his starting capital, and the logarithm is 
+8 the base two. In the above example G = 


Consider the case now of a noisy binary chan- 
nel, where each transmitted symbol has probability, 
Pp, or error and q of correct transmission. Now 
the gambler could still bet his entire capital each 
time, and, in fact, this would maximize the expec- 
ted value of his capital, Ah >, which in this case 
would be given by 


<V >= (2a)"V, : 


This would be little comfort, however, since when 
n was large he would probably be broke and, in 
fact, would be broke with probability one if he 
continued indefinitely. Let us, instead, assume 
that he bets a fraction, %, of his capital each 
time. Then . 


vi = (1 +4 Ma -L)y, 


where W and L are the number of wins and losses in 


the-n bets. Then 
G = Lim[ & acg(1+ 2) HE log (1 -2£°)] 
= Lim! 5 °8 HR +08 : 
= q log(1+4) + p log(1-4)with probability 
one. 


Let us maximize G with respect to’. The maxi- 
mum value with respect to the Y, of a quantity of 
the form Z = 2X, log Y,, subject to the constraint 


2Y, =.V,. 4s obtained by putting 
“5 
ES ee 


where X =2X,. This may be shown directly from 
the convexity of the logarithm. 


Thus we put 
(1 +2 ) = 2q 
(1 -t') = 2p 
and 
G =1+p logp+q loggq 


=R 


which is the rate of transmission as defined by 
Shannon, 


186 


One might still argue thatthe gambler should 
bet all his money (make 4 = 1) in order to maxi- 
mize his expected win after n times. It is surely 
true that if the game were to be stopped after n 
bets the answer to this question would depend on 
the relative values (to the gambler) of being 
broke or possessing a fortune. If we compare the 
fates of two gamblers, however, playing a non- 
terminating game, the one which uses the value ¢ 
found above will, with probability one, eventually 
get ahead and stay ahead of one using any other 4, 
At any rate, we will assume that the gambler will 
always bet so as to maximize G. 


Let us now consider the case in which the 
channel has several input symbols, not necessarily 
equally likely, which represent the outcome of 
chance events. We will use the following notation: 

p(s) the probability that the transmitted 
symbol is the s'th one. 


the conditional probability that the re- 
ceived symbol is the r'th on the hypo-. 
thesis that the transmitted symbol is 
the s'th one. 


p(r/s) 


the joint probability of the s'th trans- 
mitted and r'th received symbol. 


p(s,r) 


received symbol probability. 


q(r) 
q(s/r) conditional probability of transmitted 
symbol on hypothesis of received symbol. 


a the odds paid on the occurrence of the 
s'th transmitted symbol, i.e. S is the 
number of dollars returned for a one- 
dollar bet (including that one dollar). 
&s/r) the fraction of the gambler's capital 
that he decides to bet on the occurrence 
of the s'th transmitted symbol after ob- 
serving the r'th received symbol. 


Only the case of independent transmitted sym- 
bols and noise will be considered. We will con- 
sider first the case of "fair" odds, i.e., 


In any sort of parimutuel betting there is a ten- 
dency for the odds to be fair (ignoring the "track 
take"). To see this first note that if there is 
no "track take" 


since all the money collected is paid out to the 
winner. Next note that if 


i 
"5 Bs) 


for some s a bettor could insure a profit by mak- 
ing repeated bets on the s'th outcome. The extra 


betting which would result would lower @ . The 
same feedback mechanism probably takes place in 
more complicated betting situations, such as stock 
market speculation. 


There is no loss in generality in assuming 
that 


= a(s/r) = 1 
8 


i.e., the gambler bets his total capital regardless 
of the received symbol. Since 


een 


s 


he can effectively hold back money by piacing can- 
celing bets. Now 


Vy = 1 fa(s/r)a,) “Fv, 


= 1,8 


where Lp is the number of times that the trans- 
mitted symbol is s and the received symbol is r. 


es 2M log a a&s/r) 
is (1) 
ee near) 


fo) 
with probability one. Since 
1 
“s ~ p(s) 


here 


a a(s/r) 
Gia re toe eis) 


=2 p(s,r) log As/r) + H(X) 


where H(X) is the source rate as defined by 


Shannon. The first term is maximized by putting 
ices Dre sr J 
ae/r) = sBaegtdy = EET = a(s/r). 
Then G = H(X) -H(X/Y), which is the rate of 


transmission defined by Shannon. 


When the Odds are Not Fa 


Consider the case where there is‘ no track 
take, i.e., 


ae = 1] 
8 


but where 8 is not necessarily 


p(s) « 


187 


It is still permissible to set =_a(s/r) = 1 since 
the gambler can effectively hold back any amount 
of money by betting it in proportion to the 1/a. 
Equation (1) now can be written 5 


G = 2 p(s,r) log a(s/r) + = p(s) log a: 
rs S 


G is still maximized by placing a(s/r) = q(s/r) 
and 


aQ- 
Wl 


-H(X/Y) + 2 p(s) log O, 


H(@) - H(X/Y) 


‘where 


H(@) = ze(s) logs 5. 


Several interesting facts emerge here . 
(a) In this case G is maximized as before by 
putting As/r) = q(s/r). 
That is, the gambler ignores the posted odds in 
placing bis bets! : 
(b) Since the minimum value of H(@) subject 


to 
a7) 
obtains when 
ae 
a2 =. 
s p(s) 


and H(X) = H(a@), any deviation from fair odds helps 
the gambler. 
(c) Since the gambler's exponential gain would 
~ H(a) - H(X) if he had no inside information, 

we can interpret R = H(X) - H(X/Y) as the in- 
crease of G due to the communication chan- 
nel. When there is no channel, i.e., H(X/Y) 
= H(X), G is minimized (at zero) by set- 
tite max 


This gives further meaning to the concept "fair 
odds." 


When There is a "Track Take! 


In the case there is a "track take" the situa- 
tion is more complicated. It can no longer be as- 
sumed that Sads/r) = 1. The gambler cannot make 
canceling bets since he loses a percentage to the 
track. Let b, =1-2 a(s/r), i.e., the fraction 
not bet when the recei¥ed symbol is the r'th one. 
Then the quantity to be maximized is 


G = 2p(s,r) log b, +G,a(s/r) , (2) 


subject to the constraints 


DY, = 
beets a(s/r) = 1. 


In maximizing (2) it is sufficient to maximize 
the terms involving a particular value of r and to 
do this separately for each value of r since both 
in (2) and in the associated constraints, terms 
involving different r's are independent. That is, 
we must maximize terms of the type 


G_=a(r) 2q(s/r) log b, +4,a(s/r) 


subject to the constraint 


a = 
Dee a(s/r)= 1/. 


Actually, each of these terms is the same form 
as that of the gambler's exponential gain where 
there is no channel 


G = 2 p(s) log b +0.a(s) ° (3) 


We will maximize (3) and interpret the results 
either as a typical term in the general problem or 
as the total exponential gain in the case of no 
communication channel. Let us designate by A the 
set of indices, s, for which a(s) >0O, and by A! 
the set for which a(s) = 0. Now at the desired 


maximum 
OG s )%s 
Sa(s) ~ D+ a(sya_t°8 ° = * for SEA 


OG s) 


re) 
Bees = RISES toe © $ for sea! 


where k is a constant. The equations yield 


k = log e, tener DE: 


a(s) = p(s) - g- for SEA 
8 


where p = ZAp(s),0 =ZA(1f. 2), and the inequalities 
yield 


for Sere 


p(s, <b = ae 


We will see that the conditions 


completely determine A. 


If we permute indices so that 


188 


p(s)a, 2 p(s +. 2)0, 


then A must consist of all s aS t where t: is.a posi- 
tive integer or zero. Consider how the’ fraction 
die: D 
t 


t Liepey 


F 


varies with t » where 


1 
rigs A 
Py re p(s), = = 2 aq? Ve Ge ae 


Now if p(l)a, <1, Fy increases with t until o,2 
1. In this.case t = O satisfies the desired con- 
ditions and \ is empty. If p(1)q@ > 1F, decreases 
with t until p(t+1)@ , < Fy or Oz 1 ite tne 
former occurs, i.e., p(t+1)% 47 oF, then Fi4] > 
Fy, and the fraction increases wntil 94 #1. In 
any case the desired value of t is the one which 
gives Fy its minimum positive value, or if there 
is more than one such value of t, the smallest. 
The maximizing process may be summed up as 
follows: 


(a) Permute indices so that p(s)a, > p(st1) 


a 
s+] 
(b) Set b equal to the minimum positive value 

of ; 
1-p 1 i 
=—— where p, =2p(s), 0, = 2=> 

-6 , 

a . G 1 t 1 a. 


(c) Set a(s) = p(s) - b/4, or zero, whichever 
is larger. (The a(s) will sum to 1-b.) 


The desired maximum G will then be 


1 
(Crp ee) log p(s, + (1-p,) 
1-p, 


io, 


log 


where t is the smallest index which gives 


eas 
A 
1 G 


its minimum positive value. 


It should be noted that if p(s) <1 for 
all s no bets are placed, but if the largest 
p(s}#, >1 some bets might be made for which 
p(s)q, <1, i.e., the expected gain is negative. 
This violates the critérion of the classical 
gambler who never bets on such an event. 


Conclusion 


The gambler introduced here follows an essen- 
tially different criterion from the classical 
gambler. At every bet he maximizes the expected 
value of the logarithm of his capital. The rea- 
son has nothing to do with the value function 
which he attached to his money, but merely with 


the fact that it is the logarithm which is addi- 
tive in repeated bets and to which the law of 
large numbers applies. Suppose the situation 
were different; for example, suppose the gambler's 
wife allowed him to bet one dollar each week but 
not to reinvest his winnings. He should then max- 
imize his expectation (expected value of capital) 
on each bet. He would bet all his available cap- 
ital (one dollar) on the event yielding the high- 
est expectation. With probability one he would 
get ahead of anyone dividing his money differently. 


It should be noted that we have only shown 
that our gambler's capital will surpass, with 
probability one, that of any gambler apportioning 
his money differently from ours but still ina 
fixed way for each received symbol, independent of 
time or past events. Theorems remain to be proved 
showing in what sense, if any, our strategy is 
superior to others involving a(s/r) which are not 
constant. 


Although the model adopted here is drawn from 
the real-life situation of gambling it is possible 
that it could apply to certain other economic sit- 
vations. The essential requirements for the valid- 
ity of the theory are the possibility of reinvest- 
ment of profits and the ability to control or vary 
the amount of money invested or bet in different 
categories. The "channel" of the theory might 
correspond to a real communication channel or 
simply to the totality of inside information avail- 
able to the investor. 


Let us summarize briefly the results of this 
paper. If a gambler places bets on the input 


189 


Symbol to a communication channel and bets his 
money in the same proportion each time a particu- 
lar symbol is received his, capital will grow (or 
shrink) exponentially. If the odds are consistent 
with the probabilities of occurrence of the trans- 
mitted symbols (i.e., equal to their reciprocals), 
the maximum value of this exponential rate of 
growth will be equal to the rate of transmission 
of information. If the odds are not fair, i.e., 
not consistent with the transmitted symbol proba- 
bilities but consistent with some other set of 
probabilities, the maximum exponential rate of 
growth will be larger than it would have been with 
no channel by an amount equal to the rate of trans- 
mission of information. In case there is a “track 
take" similar results are obtained, but the formu- 
lae involved are more complex and have less direct 
information theoretic interpretations. 


Acknowledgements 


I am indebted to R. E. Graham and C. E. 
Shannon for their assistance in the preparation of 
this paper. 


References 
Be 


1. C. E. Shannon, A Mathematical Theory of Com- 
munication, B.S.T.J., 27, pp.379-423, 623-656, 
Oct., 1948. : 


2. Von Neumann and Morgenstein, Theory of Games 
and Economic Behavior, Princeton Univ. Press, 
end Edition, 1947. 


AN 


OF STATISTICAL THERMODYNAMICS : 


by 


OUTLINE OF A PURELY PHENOMENOLOGICAL THEORY 


I. CANONICAL ENSEMBLES 


Benoit MANDELBROT 
Faculté des Sciences de l'Université de Genéve 
Genéve, Suisse 


Summary. "Boltzmann's problem" of statistical 
thermodynamics, is that of eliminating the para- 
doxical incompatibility of structure, existing 
between the irreversibility of the classical phen- 
omenological thermodynamics, and the revers- 
ibility of any purely kinetic model, one could ever 
think of for these phenomena. One finds that, in 
order to construct kinetic ''amlogs'' to the laws 
of phenomenological thermodynamics, the dynam- 
ics of large assemblies of molecules (Liouville 
theorem, etc. ...) must be completed by some 
hypotheses of randomness. Once established, 
this randomness can be followed up in its 
development, with no new conceptual paradox 
(although with great technical difficulty ; there 

is a great amount of current work on this topic). 
But the introduction of randomness still raises 
entirely uncleared problems. Since, therefore, 
the kinetic foundations of thermodynamics are 


not sufficient in the absence of further hypotheses 


of randomness, are they still quite necessary in 
the presence of such hypotheses? Or else, could 
not one "'short-circuit'' the atoms, by centering 


upon any elements of randomness, for example those 


introduced by the process of observation? Our 
aim is to show (partly after Szilard) that a sub- 
stantial part of the results, usually obtained 
through kinetic arguments, could be obtained by 
postulating from the outset a statistical distri- 
bution for the properties of a system, and follow- 
ing up with a purely phenomenological argument. 
The spirit of the theory is extremely close to 
that of the conventional (Copenhagen) approach 

to quantum theory, and the results are quite 
parallel, although the mathematics is quite 
different. Randomness is introduced by following 
the modern statistical theory of the estimation of 
non directly observable intensive variables of 
state, such as the temperature. The discussion 
of the methodological foundations of modern 


statistics can thus be translated into a full-fledged, 


and possibly significant , counterpart of the 
discussion of the kinetic foundations of thermo- 
dynamics. Statistics is thus provided with a 
particularly concrete example for some of its 
more involved methods; 
clarified in its classical aspects, and is further 
completed with an apparently new uncertainty 
relationship. It may also be of interest to the 
communication engineer to have a unified treat- 
ment of the foundations of fluctuation phenomena, 
and of methods of fighting noise : a discussion of 


thermodynamics appears 


190 


entropy and information , performed in this 
spirit, will be given in Part II of the paper. 


i 


Introduction 
(+) 


The aim of this paper differs in one essential 
respect from that of most investigations on 
statistical problems, in communication and else- 
where. This difference should be stated at the 
outset, in order to avoid certain misunderstand- 
ings. Most authors in statistics are concerned 
with broadly engineering problems, of the 
improvement of the design of coding and detect- 
ion procedures, and more generally, of methods 
of testing hypotheses, and parameter estimation. 
These problems assume that a sufficiently 
complete understanding and description of the 
necessary laws of nature has already been 
acquired elsewhere. in some cases, the needs 
of communication have lead to an improvement 
of the laws of physics. Our problem is precisely 
the opposite: we wish to improve the knowledge 
and presentation of precisely those laws of 
physics, which enter in the above engineering 
problem of communication, by centering upon the 


1.1. The nature of the problem. 


role they play in these problems, and by using 


the concepts and the mathematics derived for 
their purposes, 


Any degree of success, we could achieve, 
would be new proof of the fact, which is of course 
quite familiar, that well-chosen engineering 
problems often bring out the essentials of a phy- 
sical situation, in a way that is useful in a far 
wider context. Such used to be the role of heat 
engines in thermodynamics, Later on, one 
tried to use the problems of coding for the same 
purpose : the starting point of those attempts 
was the misleadingly simple-looking problem, 
raised by the fact that the definition, by Shannon® 
of the information involved in communication, is 
mathematically identical toa classical definition 


(+) In order to emphasize the essential similarity 
of approach, this § was made a paraphrase of the 
introductory lines of our!, on the statistical 


structure of natural languages ; see also %)3,4,9 


of entropy by Boltzmann. The study of this pro- 
blem was brought together with that of Maxwell's 
Demon (we shall see that it is no accident, since 
these are the only two problems of soecallede 
statistical thermodynamics, which involve an 
observer's decision, that is, are actually statis- 
tical in J. Neyman's terminology”). However, 

none of the now numerous attempts, to clarify 

the relationship of information to entropy, is 
generally felt to have brought much to either 
communication or thermodynamics. This may be 
due to the fact that, although information is as 
much as the coder, or the Demon, need and want 

to know about certain processes of measurement, 
(in order to establish their global balance sheet), 
this much is very little. It is already too little 

for the communication problem of detection; all 
the more so, it would be too little to even hope to 
base upon it a better conceptual understanding of 
the role of approximate observation (that is : of 
any observer) in thermodynamics. (Besides, the 
probability p introduced in Boltzmann's formula 
S = k log p, is a very queer kind of probability 
anyway; there is a difficulty there, without counter- 
part in communication, where one always has a set 
of prior probabilities for the possible signals, 

that is, is in the ''Bayes"' case of statistics). If 

so, there could be no useful solution to the problem 
of entropy and information, considered alone. 
However, the more general problemy,of the role of 
observation and of the observer in thermodynamics, 
could now be studied in detail, with the help of the 
statistical theory of estimation (or detection), which 
has now become a very general methodological 
model of experience, as inductive behaviour of the 
physicist in the face of the unknown. ? 


It will be attempted to build a theory of therm- 
odynamics, which will be statistical as wellas 
phenomenological, around the problem of the _ 
statistical estimation of state variables; the 
problem of entropy and information will constitute 
an application of the migony.( A previous attempt 
by the author in chap 4 of ° is now quite obsolete, 
but the philosophy of models given in other chap. 
of that reference stands fortified by the present 
work), Following Kramers, statistical thermo- 
dynamics will be referred to as ''thermostatistics". 
Szilard's 8 previous approach is of the greatest 
relevance for the problem. 


One possible conceptual difficulty of this 
approach will be due to the exclusive use of 
classical probability concepts, in making clear 
exactly which phenomena and variables are to be 
considered as random. When first introduced into 
statistics by J. Neyman, classical probability 
encountered "deep rooted habits of thought". 
Besides, the approach will be very theoretical; 


191 


but Whitehead may have been right again by 
asserting that“often in our most theoretical 
moods we may be closest to the most practical 
applications. On the contrary, the mathematics 
will be quite simple, aithough not usual in 
information theory studies. This is because 
we can choose the engineering problems we 
"invert'', to be as simple as possible, which is 
of course not the case in engineering problems 
imposed by practical needs. 


1.2. The nature of the result 

Let us elaborate upon this problem, One aim 
of communication theory is to find ways and means 
satisfying certain criteria of quality, by which a 
signal could be detected through a backgroung 96 ll 
noise, It has been recognized for some time’’ ’ 
that this problem is,in principle, simply one of 
estimating "at best"! the emitted signal, S,, 
knowing the received signal S;. For that, S; is 
considered to be an observation from a random 
(because perturbed by noise) population of 
signals, and S, is the parameter of this popu- 
lation. 


The distribution p(S;/S,) is considered 
to be a part of the engineer's scientific know- 
ledge of nature, that is : at worst, it must be 
determined by special observation, at best, (if 
noise is the least compatible with the "structure 
of matter'') it is given by physical laws of more 
general validity: those of thermodynamics and 
of quantum theory. We shall show that, con- 
versely, the conventional laws of thermo- 
statistics, from the foundations up, and a few new 
laws, can be obtained by caracterizing thermal 
noise as being the*least disturbing for the 
physicist? certain amount of imprecision in 
what is the"least disturbing"will be shown to be _ 
allowed by a corresponding familiar imprecision 
of thermodynamics. The main criterion in- 
volved, that of "sufficiency of certain valuations 
of observation", is not even a variational 
criterion ; it postulates the impossibility of a 
certain inference, and is an authentic counter- 
part of the exclusion of certain heat engines by 
the Carnot principle. Anyway, whichever 
arbitrary anthropo-centered character may remain 
in the theory, will be a counterpart of the arbit- 
rary assumptions about molecules, required in 
the kinetic models, which aim to explain why 
noise is the least disturbing for the observer 
which is considered. 


Further methodological discussion of the 
approach will be made in § 4, when the systems 
sutdied are defined (§2), and the Maxwell Botz- 
mann distribution is derived in several ways 
(§3). Final discussion will be included in §6. 


2. States of a statistical physical system 


2.1 Restriction to one-parameter systems 


In any truly stochastic physical theory, the 
relationships between the various state variables 
of a system are assumed, from the outset, to be 
ruled by probability laws. The degree of com- 
plication of a theory is then determined by the 
most complicated family of probability distri- 
butions considered in that theory. We shall 
start by a restricted problem, in which the only 
families to appear will depend upon a single 
real parameter. The case of several parameters, 
and the non-parametric case will be considered 
later; the generality of the approach will appear 
only then, and it will be seen that one can obtain 
directly a theory taking account of exchanges of 
matter and of quantum effects. 


2.2 Definitions 

Consider a set of methods of measurement, 
defining a certain level of refinement of the 
physical analysis. Physical variables can then 


be of several kinds, distinguished as being, on 
one side, extensive or intensive; on the other 


side, as being observable or estimable. 


Observable variables are those which can be 
considered as random variables (r.v.)before 
measurement (and expressed by capital letters, 
such as U); and can be actually and directly 
measured, by single real (scalar continuous) 
numbers (expressed by lower case letters, such 


as.u). The precision of results of measurement 
will always be taken as infinite, which means that, 
following the quantum theory, we shall consider 
that, by measuring an observable, one actually 
and physically puts the system ina "state" des- 
cribed by the value which has been found. The 
measurement may of necessity be infinitely 
slow. Of course, even then,the results of 
measurement are usually given only within a 
margin of possible error, relative to a"truer” 
value; the latter would be a r.v., depending upon 
the system itself, whereas the error would bea 
r.v. depending upon the process of observation. 
But in fact, there is no criterion for sharing the 
contributions of the two sources of randomness: 
at best, the ''true'' value could be considered as 
an estimable. 


Consider now a system before measurement. 
Its ''state"' is a probability distribution function 
(p.d.f.); it can be considered as a "mixture" or 
"superposition" of ''states' after measurement. 
If necessary, the state will be called ''mixed", 
as opposed to "pure" states after measurement. 
This is a relative concept, since a state which is 
pure with respect to one observable may also be 
"mixed" with respect to another. A pure anda 
mixed state are complementary uncompatible 
descriptions of a system. Measurement thus 
involves an unpredictable sudden jump from state 
to state, or rather from being partly in several 


192 


states into being in one: this has nothing shock- 
ing in detection theory. 


It will be assumed that the only way, to real- 
ize a "microanonical set" of systems, all having - 
equal values for some observable, will be to 
pick them one by one, after measurement. Thus 
no "intensive'' observables will be considered. 


Estimable variables are those parameters, 
which express the dependence of the probability 
distributions of the observables, upon the pror 
perties of the physical system studied. One 
must therefore consider populations of systems 
(the "collectives", in the older terminology), all 
having the same value for some estimable. When 
such an equality can be considered as realised by 
a physical interaction, the estimable is called 
"“intensive''. However, much of the theory holds 
for any estimables. 


The obvious example of an observable is the 
energy U; the corresponding estimable variable 
is the inverse temperature B (the "observable"! 
character of the energy is an universal axiom of 
physics, although sometimes hard to justify). A 
measure of energy is, for example, obtained 
with a so called "thermometer'! by observing 
the change in its geometrical shape when it 
absorbs U, On the other side, the uniformity of 
the B's of a set of systems can be ascertained 
by letting them a sufficiently long time in ther- 
mal contact. (It is not useful to say that fixing B 
is a kind of measure of it!) Let the probability 
distribution and eventually density, of U be 
given by P(u/B) and p(u/B) : 


P(u/B) = Pr.[Usu/B] ; p(u/B) = dP/aB 


Both kinds of variables have been defined 
within a one-to-one transformation, only. Some- 
times, there exists a particularly intrinsic 
determination of an observable, which is additive, 
i.e, such that the u of any union of disjoint sets 
is the sum of the separate uj. The p(u/B) is then 
given, knowing the p;(uj/B), by the "convolution", 
iteration of : 


u 
p(u/B) = Sir (uj /B) pz (u - u,/B) du, 


B is still an estimable variable of the sum; it is 
also an estimable variable of any subdivision of 
the system, if possible. 


2.3 The estimation of estimable variables®, 


Consider now a system of known energy u. 
Let the prior density p(u/B ) be positive for all 
positive u, as is usual. Since the energy is 
known, the very definition of estimable variables, 
as parameters, fails. In fact, any single system 
can be considered as being in ''thermal equili- 
brium"' with any thermostat (i.e. an infinite set of 
other systems in thermal equilibrium with each 
other), since the addition of a single system, 
having a possible value for u, does not perturb 
the probability distribution of an infinite set. 


From the viewpoint of prediction of future events 
happening to the system, the population from 
which it is drawn should have, intuitiveley, little 
importance (we shall in fact assert later, after 
Szilard, that it has strictly no influence, and this 
will be shown possible only in the case of the 
Maxwell Boltzmann prior distribution). Strictly, 
to know B is therefore a problem of "retro- 
diction", But one may also wish to predict a 
prior distribution, for the energy of other sys- 
tems of the set from which the first was drawn, 
and,for that,some reasonable guessing of the 
initial B would be useful. This B, although un- 
known, is not a random variable. Except under 
very rare conditions, where one has a prior 
distribution for it (the 'Bayes case"), there is 
no limit-of-frequency sense, to be attributed to 
the feeling that "some values of B are more 
'probable' than others'. However, there is a 
very clear meaning to statements of probability 
of making an error by guessing a certain value 
for B. One can then try to construct estimators, 
or estimating intervals, so as to minimize 
certain chosen probabilities of error. Mathe- 
matical statistics is a technique for doing so, 
but it cannot ever justify any chosen criterion 

of good guessing. 


Any estimator is a single-valued function of 
u; therefore, by applying it to parts of a whole 
of uniform temperature, one obtains a classical 
probability distribution of estimated “local 
temperatures"; the parameter of this distribution 
is the true B : therefore estimates of B are not 
intensive variables. Many estimators are 
consistent, i.e. the distribution of the local 
temperatures gets more and more concentrated 
around the true value, when the size of the samples 
used increases, This makes it possible to 
measure the temperature of an infinite set with 
any precision. But for any sample size, the 
estimation of B is an essentially irreversible 
procedure; this fact was emphasized in the 
information theory literature by P.M. Woodward!l2 
(only in the Bayes case); it will be taken up in 
Part II in great detail, 


Under these conditions, the replacement 
of the knowledge of u, and of p(u/B), by a single 
estimator of B, or by upper and lower bounds for 
B, is an operation entirely different from the 
meastte of the random variable U. In the Bayes 
case, however, it is very close to the operation of 
prevision of the most probable or average future 
evolution of a system ruled by probability laws. 
The predicted evolution is of course different from 
the actual one. The point has lead to great 
discussion in quantum theory, see von Neumann]3 
(§V.1.). But a simpler case of the difference bet - 
ween spontaneous random evolution, and noisy 
estimation is given in informationtheory by.a_., 
comparison of Shannon's definitions of information 
of a Markovian message and of a noise-perturbed 
one. Both are specified by stochastic matrices 


(matrices such that line sums are one). But the 
ideas are entirely different, and the main point of 
reversibility and irreversibility in statistics is 
more basic than the hypotheses that matrices are 
stochastic (or unitary in quantum theory). 


Fiducial distributions, As a measure of '"'degr 
of confidence” in various values of BR , R.A. Fishe 
(1934) has introduced a certain "conceivable" 
distribution of possible values of an actually non- 
random variable, a distribution based upon p(u/B). 
This concept (defined only in the case of 
sufficiency, see § 3.1) is often considered as to 
be carefully avoided, even if it is not claimed to 
be a kind of probability distribution. It seems 
however to be implicit in some thermodynamical 
arguments, where it is not necessary, as we 
shall see. 


In view of this irrevesibility of estimation, we 
shall delay the consideration of specific methods 
to §5, and first consider the case where there 
exist intermediate steps of the estimation, which 
are reversible with respect certain questions, 
that is : such that if one takes them, one loses 
"no information", in appropriate senses of 
information (more general than Shannon's). One 
definition of reversibility will fully determine the 
Maxwell-Boltzmann distribution, that is : 
characterize thermodynamics. But in any case, 
if there exist possible reversible steps, one 
should start by them. 


3. The reversible part of the estimation of B, 
Derivations of the Maxwell Boltzmann dis- 
—otlon 


tribution, ~ 


3.1 First derivation, based upon the concept of 
sufficiency. 


Criterion of sufficiency. Suppose that the exten- 
sive Observable u can be chosen to be additive : 
this is indeed a very strong hypothesis, since it 
excludes any interactions between neighbouring 
systems, and a fortiori quantum interactions 
between distant systems. Let us take several 
sample systems from a population believed of 
uniform B. Let uj; be their energies, and 
w= lM One could estimate B, through Be, 
either starting from u or from all the uj. In- 
dependence of the result with respect to shape 
or disposition of the systems,implies that B, 
should be a symmetric function of the uj. If 
moreover any pi(ui/B) can be considered as the 
distribution of a sum of still smaller variables 
(the possibility of doing so will be investigated in 
(Part III) one can estimate Be from finer and 
finer subdivisions of the system. But B is con- 
sidered as‘a ''macroscopic" intensive variable, 
which roughly implied that from a certain level 
down, no improvement of the estimation can be 
obtained by measuring finer subdivisions, although 
such subdivisions are quite possible. 


a73 


Postulate this independence of the estimate, 
relative to the knowledge of individual uj, to be 
strict, and no more asymptotic, This means that, 
for example —) Gea 


or example : 


(S) Pr (u,/u, B) = Pr (u,/u) 
independently of B. Nothing could then be drawn 
from the observation of the sample distribution 
of the energy among the different parts : no 
Maxwell's Demon could beat the macroscopic 
observer in the estimation of B, by measuring 
the partition of energy among molecules. In 
R.A. Fisher's!* statistical terminology, u is then 
said to be a sufficient statistic for the estimation 
of the original B. If sufficiency is postulated 
at one level of subdivision, and additivity at all 
levels, sufficiency holds at all levels. 


The intuitive concept of macroscopic was 
apparently weaker: it meant only asymptotic 
sufficiency : estimation depending rapidly less 
and less on finer subdivision; it could then 
depend on the first subdivisions. But on the 
other side, putting together several systems, and 
letting them interchange energy, is considered to 
be a reversible operation at any level. In fact, 
however, any axiomatic approach to thermo- 
dynamics is bound to be scale-less (see the 
axiomatics of Caratheodory, which is valid even 
for a single molecule!5; see also the principle of 
Casimir, that the phenomenological laws of 
irreversibility apply even to fluctuations). It will 
turn out that strict sufficiency is implicit in the 
usual thermodynamics. 


To sum up, the principle of sufficiency can be 
considered as established by all the experiments 
leading to the belief in the possibility of mac- 
roscopic description: It is thus established only 
below a certain level. On the contrary, the prin- 
ciple of additivity is established only above a cer- 
tain level. We postulate that both hold exactly 
in a certain strip. of,sizeoam apm yf alae sain 


Maxwell Boltzmann (M.B.) distribution, Under 


certain regularity conditions, the only distribution 


for which u is a Sufficient statistic for Bis the 
M.B. probability density : 


p(u/B) = G"(B) S(u) exp(-Bu) 


The detailed statistics can be thus derived from 


a single overall principle, a qualitative one, in 


the sense that it answers to a yes or no alternative: 


this cannot fail to recall Carnot's principle or the 
axiomatics of relativity theory. Strictly speaking, 
one should write some function of B instead of B, 
but the scale of B can be assumed fixed in con- 
sequence (the scale of u is fixed by additivity). 
One can even prove from sufficiency that the best 


estimate of any function of B is that function of the 


best estimate of B; in the case of sufficiency, 
estimation and functional transformation are 
commutable. 


194 


The above result was proved independently in 
1936, by G. Darmois (see appendix l), B. O. 
Koopmanl!? and) HevaGe Pitmanl8; before them, it 
was implicitly proved by L.Szilard® in 1925. 
Under slightly weaker conditions of regularity, 
the distribution need not have a density, and 
may be of the form : 


dP(u/B) = G"'(B) AV(4) xp (-Bu) 


S(u) and V(u) are the ''structure function" and 

the "integral structure function" of the system. 
The ''structure generating function" G is, because 
ot the requirement that Spdu = 1, the Laplace 
transform 


GB) = f SW) exp Bu) du = fexp(-Bu) aN) 
we shall also need the function J(@)= log G(B), 
called ''Planck potential", The usual probabilistic 
characteristic function of the M.B. distri- 
bution is 


¢(t)= Jeep WB) da = G(B+it)/G(B) 
(The cal generating function is @Et logx) ) 


Essentially, sufficiency, through the M.B. 
distribution, requires the separability of the 
variables u and B, appearing in the two-variable 
function p(u/B), in the sense in which one 
separates variables in differential equation. 


The general result, that the expected value 
(E) and the variance(D)of U, knowing B, are the 


first two derivatives of log p(t) jie te = Op 
now becomes ae dt 6(8) ay 
EU=E(UB)- Suplu(8 au =- BE = - 5 


DU=D (UB): {(u-BU)" plu/B)du 


oo a ) % 
dtlog6(B)_ J 
dpe dB 
Note that G(B) may be defined only for 
B2Bqa70, and be a divergent integral for 
B<Bg.- It is impossible in ordinary thermo- 
dynamics, of matter, that By 70 ; but, 
clearly,the present theory is much more general 
than is required by matter, and there are except- 
ional applications where a positive abscissa of 
convergence of the Laplace transform is the 
main featurejsee 1,2,3,4,5 


When J(B)/B is identified to the "free energy" 
of a system’”, it is seen that the free energy 
cannot be any function of B ;: its exponential 
must have a postive inverse Laplace transform 
(it is also called "completely monotone"; the 
signs of successive derivatives alternate). This 
fact has, surprisingly, never bean mentioned, to 
our knowledge, in papers on statistical thermo- 
dynamics : it is clearly because for very large 
systems, it is irrelevant, cf. §6, so that there 
is no need to justify it in large scale thermo- 
dynamics. 


Distribution of a sum of MB systems. Consider 
the sum of two systems following MB distributions 
with respectively structure and generating funct- 
ions S}, So, G, G2 and same B. The distribution 
of the energy of the sum of these systems will be 


ptu/B)= flu) GB)exp(- Buse (04,47 (B) exp Blum) dy 
[fs (us) S, (uvs) dug Je (B)G, (6) e(- Bu) 


Thus, the distribution of the sum of MB systems 
of same B is still MB: in the addition, structure 
functions transform by convolution, generating 
functions simply multiply, and the J(B) functions 
simply add. This latter function may be con- 
sidered as a particularly intrinsic measure of 
"contents". The same results clearly hold 
for finite sums of systems, and also for denumer- 
able sums. 


3.2. Szilard's derivation of the M.B.distribution 


It turns out that the above derivation of the 
M.B. distribution could have been an inverted 
paraphrase of a derivation due to Szilard®.However, 
Szilard did not go much beyond this derivation, and 
his presentation could not readily profit from 
subsequent developments of mathematical 
statistics. Szilard's paper is often quoted, see 
von Neumannl3, but seldom analysed. Similar 
considerations can be found in a paper by G.N. 
Lewis Ps quoted by Fowler and Guggenheim2 ° 
Both authors wished to show how fluctuation 
phenomena can be introduced into classical 
thermodynamics, without destroying its structures 
and spirit. ''The second principle loses nothing in 
rigour because of fluctuations, and in no way 
becomes an approximate principle : it melts into 
a higher harmony containing the laws of fluctua- 
tion," 


Szilard considers systems in random evolution, 
i.e. such,that their properties at one time determine 
only probabilities at a later time; in particular, 
energy exchanges between systems in contact are 
only random. He assumes that the probability 
distribution of a system at an instant of time 
depends on one parameter only; the temperature, 
that systems which have long been in ''thermal" 
contact and have exchanged energy have equal 
temperatures and may be considered to be in 
thermal equilibrium, finally, that systems in 
thermal equilibrium at one time remain in 
thermal equilibrium. The existence of a single 
number T, characterizing thermal equilibrium, is 
the"Zeroth principle"of thermodynamics, following 
Fowler (see 21) (in our more purely probabilistic 
model, we did not necessarily need to introduce 
energy exchanges) (The fact that energy does not 
vary in the interactions,is the First Principle). 
Temperature is not however assumed to deter- 
mine a single energy, but only some super- 
position of states of different energy (these 
states have an objective existence, independently 


195 


of any observer); this distribution is realized 
among systems having long been in long thermal 
contact with a "thermostat", 


The infinity of possible energy distributions 
will now be reduced, by translating the fact 
that p(u/T) cannot be modified without a 
compensation, for which a new thermal contact 
is necessary. This will be translated into 
postulates, about the more detailed nature of 
thermal equilibrium, which are a priori only 
"reasonable'' but a posteriori experimentally 
correct,in the sense that they lead to the M.B. 
distribution, 


Let two systems, of energies u, and up, 
initially in contact with a thermostat, be sép- 
arated and brought in very long thermal con- 
tact with each other. Their initial energies 
Ul] and U2 were independent random variables, 
of same parameter T. Szilard assumes that 
the nature of thermal equilibrium is such, that 
the energies U'] and U' 2,after long contact,are 
again independent random variables, having 
moreover distributions independent of the initial 


temperature, and of initial values uj and u>, and 
conditioned only by the constant total energy u. 


Similarly, Lewis considers that, when a 
quantity is shared between two systems,"the 
ratio of specific probabilities of any couple of 
partitions depends only upon the nature of the 
two systems, and in no way upon their method of 
connection, or upon the existence, nature or mode 
of connection of other systems” He then post- 
ulates that ''one should expect that the probab- 
ilities of various partitions of a quantity U bet- 
ween two systems depend only upon the total u, 
and in no way upon the reservoir with which the 
two parts are in connection. Thus, one would 
expect the same partition, whether the two 
systems are in very imperfect contact with 
one reservoir, or whether they are in contact 
with another reservoir, of very much higher 
temperature, but have the same total energy 
through a very rare fluctuation. (This is not an 
exact quotation, but the English translation of 
the author's French translation of Lewis's 
words). 


It is seen, with great pleasure, that both 
authors have rediscovered the principle of 
sufficiency, exactly in the form (S) of § 31, 
but have interpreted it as a property of equilibrim, 
not of observation, As already mentioned, 

Szilard has even anticipated the derivation of the 
M.B. formula, and Lewis has considered ratios 
of probabilities of couples of partitions, which 
is now a constant procedure in likelihood ratio 
tests. 


It is remarkable that one can use sucha 
"yes or no", and not extremal definition of 
equilibrium, and go so far. 


To prove: the MB pdf, Szilard) writes: that: 
pr(uy) p2(uz)/p(uy + uz) should be independent of 
the T. Compare this: with some'UWerivations" of 
Boltzmann's other formula for entropy S: 

"S = log p''. They amount to assuming that 


a P, (32)/P(Si+S.) is not only independent of 
but strictly equal tol. This has the dis- 


Bivautaue Guanpsosl aati ec probability, and’ of 
the immediate introduction of a new concept, 
identified only later. Besides, to introduce the 
needed S(u) term, one must postulate immed- 
iately that one energy may correspond to sev- 
eral different distinct (degenerate) states 

(see § 5.1) Szilard's approach is clearly pre- 
ferable. 


Through a discussion of fluctuation sizes 
for different systems, Szilard shows that T is 
a universal temperature; however his discussion 
of fluctuation is less complete than it is now 
possible to make, and generally, his: approach, 
although closer to usual thermodynamical 
thinking, is less easy to generalize (chiefly 
because of the restriction on intensive variables). 


Principle of Onsager-Casimir. In the theory 
of the irreversible decay of deviations from 
equilibrium22 one has to assume that the same 
laws apply to’ “small but macroscopic’ deviations, 
and to fluctuations. This amounts to postulating 


that the way a deviation was reached is irrelevant, 


and only its Dee imports. This is a mark- 
ovian hypothesis?3 (which is a probabilistic form 
of Huygens's principle): statistical independence 
of past and future, when the present is known. It 
leads to Onsager's relations of reciprocity. In 
the necessarily weaker theory of equilibrium, we 
are concerned with, one makes a weaker hypo=- 
thesis that independence from the past is attained. 
after a long time only. Thus, Onsager's theory 
is a quite proper "interpolation" of the present 
theory, with a full markovian hypothesis. It is 
curious to note that a weakened probabilistic 
Huygens's principle leads to the same results as 
Carnot's principle. 


3.3. Second derivation, based upon the concept of 
efficiency, 


Criterion of efficiency. It was mentioned in §3.1, 


that, since the estimate of B is a non-random 


function of u, it is a random variable, when referred 


to the ensemble of constant B, from which the 
system is drawn, Assume that the distribution of 
the u is known, but not necessarily M.B., and 
compare different possible methods of estimation, 
all assumed to be unbiased, that is such that the 
mean of the estimator is equal to the true value. 
For that, compare their variances, i.e. mean 
square deviations from the mean. (This is a quite 
different concept from the usual temperature 
fluctuation), 


196 


It can be shown (see appendix 2) by an app- 
lication: of Schwartz's inequality, that, which- 
ever the distribution p(u/B), the variance of an 
estimate of a function f(B); is necessarily 
bounded below by the expression +; 


dt d log ple) -n/ A? log plu/B) 
pio LB) ee Bp 9 eset 


AB? 


F is called’"Fisher's information". This 
limit means not only that one does. not know 
how to perform a more precise estimation, 
but that under certain conditions of 
regularity, one could not conceive of any such 
estimation(which, even though it could not be 
justified ‘by its properties, should not be dis- 
missed for empirical reasons), (Remark that 
Hodges and Le-Cam have shown examples of 
estimators more precise than the above limit, 
but only for a set of values of B which is of 
Lebesque measure O). The existence of upper 
limits to the precision of measurements con- 
sidered as indirect, which are in fact estim- 
ations, is quite familiar in quantum theory. 


Maxwell-Boltzmann!'s distribution. Usually, 
there are still closer lower limits to variance, 


but Fisher's limit can be exceptionally attain- 


ed cei ke 3) if 
E(e 10g p/u/B) ) 


dys, 


Under certain conditions of regularity, the 
lower limit to variance can be attained only 
with the M.B. distribution, us the overall 


statistics can also be derived from a variation- 


al principle. 


Then, Fisher's information takes the espec- 
ially ieee ore 


d23(B) 
y= ee log G(B) - ts a D(E) 
it is a function of the "energy content" J(@). 


Uncertainty relationship, Let {(B) = B. Then 


clearly 
DU. DB =1 


The less well-known is the energy of a 
system, when its true temperature is known, 
the "larger'' the system (in terms of its con- 
tents log (B), and the better the temperature 
can be estimated from such a large sample. 
And conver ely.) «in Gay ote aan 


And conversely, 


Or else, disregard the true temperature. 
Any estimation gives B together with DB; 
identify the estimator of B with the true B, but 
only for the purpose of éstimating the DU of U 
before the measurement, Then, the larger DU, 
the smaller DB. 


One cannot fail to note the formal identity of 
this relationship with Heisenberg's relationship, 
or its equivalent in communication: Gabor's 
relation "Df, Dt = 2" in the spectral analysis of 
signals. In fact, all three are one relation : they 


result from equality cases of Schwartz's inequality, 


for dual variables. It does not matter, of course, 
whether the duality is Fourier, (quantum theory 
and spectral analysis) or Laplace (here). But 


in fact there is a deep difference: in Heisenberg's 


relationship, the observer can choose which 
variable he wants to know with greater precision; 
here, this is determined by the contents J(B) of 
the system studied : B is known exactly only for 
the infinite thermostat, and U exactly only when 
it is certainly zero. 


However, it is pleasing, writing DU.D(1/T)=k, 
to find an absolute meaning attached to size, ina 
way formally quite similar to the introduction of 
the size of quanta by the relation DU.Dt = k. 


Proper scale. It would be pleasing to have a new 
scale of B, such that DB be independent of the 
estimated B, that is, from the measured u. (This 


seems to be what MacKay calls the "proper scale"), 


Clearly : 


§(B)- f (H3E0 as 


This proper scale was implicitly used by Girshick 


and Savage 4, in proving that u is an admissible 
minimax estimate of the intensive variable EU, 
relative to the risk function (u - EU)%/DU. 


"Noise", Take now f(B) = EU =-d log G(B)/AB 
One finds that D(EU) = DU. In other terms, the 
estimation variance of the ''true' value EU, is 
equal to the fluctation of u around its mean 
value. This is not obvious, but a theorem, which 
should be brought together with the fact, ment- 


ioned above, that when a sufficient statistic exists, 


the estimate of a function of B is the function of 
the estimate. Further, the fact, that even when u 
is assumed strictly known, there is a sense to be 
attributed to the fluctuation of something very 
close to U, is very comforting,in view of the 
paradox in considering the measure as infinitely 
precise. But even this new noise has nothing to 
do with "hidden variables". 


3.4 Other derivations of the MB distribution 


The concept of sufficiency has many other 
aspects. Their discussion has better however be 
delayed until some new concepts have been intro- 
duced. Let us mention however, that sufficient 
valuations of observations preserve ''information" 
in many different senses of the word, Fisher's, 
and also Shannon's (and in a generalisation of 
both these senses due to Schutzenberger, see Part 
II ),and also in a sense due to Bohenblust, 
Shapley and Serman (see §5.2). 


197 


4, Discussion on the methodology of §3 : 
Two complementary types of statistical thermo- 
ynamics. 


Let us interupt here the construction of the 
theory, to comment upon what is being done. 
(The reader may immediately proceed to §5, 
where a great amount of new arbitrariness is 
introduced, in the actual irreversible estimation 
of the B of the M.B. distribution, and to §6, 
where this arbitrariness is shown to asymptot- 
ically vanish for all practical purposes, in large 
systems.) 


We have succeeded (partly after Szilard) in 
deriving, from purely phenomenological criteria, 
some results of statistical thermodynamics, a 
science considered to be so deeply related to 
kinetic and such models, as to be also called 
statistical mechanics. The M.B. distribution is 
no longer characterized as being the one which 
would be steady under collision phenomena, but 
as having certain good properties under obser- 
vation. Several non classical results were also 
found, How does this fit into the classical scheme? 


4.1 The two classical methods of thermodynamics. 


At least since Clausius, one recognizes two 
methods of mathematical structuration for 
thermodynamics: the phenomenological, also 
called macroscopic, pure, classical, axiomatic, 
etc., and the kinetic, also called microscopic, 
statistical, etc... The latter contains more 
results than the former, in particular the 
statistics, and is also the older of the two (the 
Greeks, Gassendi, the Bernoullis), and closer 
to intuition. Despite this, it is considered as 
conceptually subordinate to the other, since it 
is required to derive the principles of the other 
as theorems, whereas the reciprocal is usually 
not considered. 


- How well is this explanation achieved? From 
the viewpoint of rigor, notoriously very poorly: 
there is.in fact a complete paradox a priori in 
any kinetic explanation: it is the fundamental 
incompatibility of structure between the irrev- 
ersibility of some principles of phenomeno- 
logical thermodynamics, and the mechanical 
reversibility of any purely kinetic model,one 
could ever think of for these principles,(e.g. the 
paradoxes of Loschmidt and of Zermelo@5), 

These objections gradually forced Boltzmann to 
add to the mechanical assumptions. Following 
Uhlenbeck, let us call the problem of reconciling 
the two viewpoints, Boltzmann's problem2 ane 
ideal would have been to derive the macroscopic 
results from microscopic assumptions; but in 
fact one@? is more properly looking for analogies, 
or logical structures analogous to thermodynamics, 
by adding to the kinetic models some rather 
arbitrary randomness assumptions, often intro- 
duced through the necessary inprecision of 


"coarse observation": Any initial unevenness of 
probability distribution, although strictly preserved, 
because of Liouville's theorem, is supposed to 
dissipate itself into thinner and thinner stream- 
lines; so that any coarse density becomes uniform, 
after an average time which increases when the 
coarseness decreases, For a system of a given 
age, there will be an extremely rapid, and 
increasingly sudden, variation of properties, when 
the size of the cells goes to zero. For a very old 
system, one tends towards the situation, in which 
there is a sharp discontinuity, and a "singularity" 
of phenomena near infinitely sharp definition. 
This recalls the singular phenomena observed 
when the viscosity of a fluid tends to zero, which 
also relate to irreversibility and dissipation. 
Once established, this randomness can be follow- 
ed in its development, with no new conceptual 
paradox, although with great technical difficulty, 
there is a great amount of current work on this 
topic28, But the introduction of randomness still 
raises entirely uncleared problems. 


4.2 Statistical thermodynamics without hidden 
variables. 


Since, therefore, the kinetic foundations of 
thermodynamics are not sufficient,without a 
further (single29) hypothesis of randomness, 
are they still necessary in the presence of such 
a hypothesis? After all, although statistical 
thermodynamics owes its origin and development 
to the theory of atomism, it need not always be 
so. In fact, partly after Szilard”, we are in the 
process of showing that one can "short circuit" 
the atoms by centering upon any element of 
randomness, for example introduced through 
necessarily unprecise observation, and we are 
deriving a substantial part of the fundamental 
results, usually obtained through kinetic argu- 
ments, by following up the introduction of random- 
ness in a purely phenomenological way. 


The fact would have been of course more 
striking, before the times when atomic theory 
ceased to be a rather doubtful conjecture. It is 
also a pity, that Szilard's original paper was not 
more often quoted in contemporary ( around 1925) 
discussions about the "causal interpretation" of 
quantum theory through "hidden variables", In 
fact, a rapid decline of interest in the phenomeno- 
logical and "energetist" approach to thermo- 
dynamics was contemporary of the Copenhagen 
approach to quantum theory( the greatest success 
of the school of "never going beyond observation", 
even in words: a still triumphant school in thermo- 
dynamics in 1895, when Boltzmann was complaining 
that "kinetic theory was, so to speak, out of 
fashion in Germany"), There is however a renewed 
interest now in causal reinterpretations. Take for 
example von Neumann' sl3 proof of the impossibil- 
ity of introducing hidden variables, into the convent- 
ional (Copenhagen) approach to quantum theory, 
without first modifying it. This proof is part of 


198 


a reduction of a large part of the quantum rules 
t a set of phenomenological and axiomatic 
rules for observation, starting from a purely 
stochastic viewpoint very similar to that used 
here. The incompatibility between a quantum 
structure and hidden variables, may therefore 
be compared to the incompatibility between 
thermodynamics and kinetic theory (para- 
doxes raised against Boltzmann). An attempt 
to go around those paradoxes, such as current 
work of Bohm?0, de Brogl ie3l and Vigier32, is 
to be compared to the attempts to ''solve"' 
Boltzmann's problem. This current work does 
not actually attempt to build a theory of which 
the quantum theory would be the ''thermo- 
dynamics", but simply to disprove the imposs- 
ibility of building such a theory, by introducing 
randomness not fundamentally, but through’a 
chaos hypothesis. It may be that the possibility 
of such a twin set of quantum thoeries would now 
appear less shocking in view of the existence 
of a statistical thermodynamics without hidden 
variables, apparently just as "closed" as the 
conventional quantum theory. However, even 

if there some pleasure and help,in the great 
similarity of the words used to describe the 
two situations, there is no question of math- 
ematical identity: an opposition will remain 
between the unitary and contact transformation, 
and between a purely real and a complex theory. 
Only the single opposition,of quantum vs. non 
quantum statistics,will be completed by a second 
opposition of phenomenological vs. kinetic. 
Each of the six possible comparisons of the 
four theories is enlightening. 


This methodlogical discussion will be 
continued through §6 and in the conclusion to part I 


5. The irreversible and final part of the 


estimation of B. 


5.1 First method of estimation; maximum 


likelihood B and Boltzmann's most likely state. 


Degeneracy. ; 
To derive the M.B. distribution, in §3, we 
did not need to specify the actual arbitrary 
method of estimation,to be used in order to attain the 
optima shown to be possible. The function 
linking B to u must now be specified, In 
principle, rules of estimation should be 
derived from the properties desired from the 
estimate. In fact, however, the most usual 
rules of statistics were introduced for no 
conscious reasons, except simplicity, and if 
motivated at all, were first motivated by consid- 
erations of ''degree of reasonable belief"' quite 
foreign to classical probability; and only later 
justified by their properties?, The estimation 
theory of thermostatistics is only implicit, 
though quite real; it will turn out, curiously 
enough, that it has used exactly the same 
procedures as statistics’ 


Start by the most widespread theory of 
estimation; R.A. Fisher's} theory of maximum 


likelihood estimation. To find a measure of 
rational belief in a value of B, when we are reas - 
oning from the sample to the population, Fisher 
inverts the function p(u/B), taking now u asa 
parameter, and B as a variable: Of course, p(w) 
ceases then to be a probability distribution, and 

B is no random variable. For example {pdB#{ 
(and it even depends upon the arbitrary scale of 
B);therefore one cannot speak of the likelihood 

of the set of all values of B, but only compare 
likelihoods. Taking now the M.B. distribution 

as having been derived in §3, maximize 


log p(u/B)=- log G(8) + log S(u) - Bu 


This requires that 


U=- d log G(B)/dB 
B will be obtained by inverting this implicit 
equation. 


One cannot fail to note the identity of this 
result with a classical formula by Boltzmann]5, 
Let us review the pmofof that result. To define 
the temperature of a system of given u, one 
takes that only EU =u is known, then one derives 
the "most likely" distribution of this energy 
between the available (discrete) states. For that, 
one maximizes the "entropy", or logarithm of the 
likelihood, given EU = u, and El = 1. The derivat- 
ion, using Lagrange multipliers, is too classical 
to repeat. The reason for saying "likelihood" 
instead of probability, will appear soon. This 
gives the distribution 


b(any state of enngy u/B)= Ke &% 


as being the most likely (M.L.) This seems to 
differ from the M.B. distribution, but in fact one 


introduces the further assumption of 'degeneracy": 


that there may be 5(u) different states of same 
energy. Then ,the most likely 

distribution of energy turns out to be the MB pdf; 
without further arbitrariness, one gets a relation 
between B and EU, viz: 


EU =- dlog @(B)/dB 


Then one replaces herein E(U) by the known u, 
and inverts to get B. Altogether Boltzmann's 
argument amounts to an implicit and improvised 
theory of estimation, anticipating Fisher's 
theory. Note that the assumption of degeneracy 
would not have changed the result of taking the 
M.L. value of B, in Fisher's approach, since 
the probability of any single state of energy u is 


P (amy state of energy u/B) = G''(B) exp(-Bu), 
and since the log $(u) term drops out in the max- 
imation of log p, relative to B. 


Therefore, taking degeneracy for granted, the 
only technical progress made, comes from the 
fact that maximum likelihmd estimates have been 
rather more carefully studied than most likely 
states. It is quite true that the M.B. distribution 


199 


is now obtained without the arbitrariness of 
the "maximum"! specification (which needs 
no amplification in thermostatistics, see 
Darwin and Fowler and Khinchin33; but the 
same arbitrariness is found immediately at 
the next step. 


But what about a conceptual benefit taken 
from the translation? Neyman” had critically 
shown, in detail, in what respects a likelihood 
differs from a classical probability as a limit 
of frequency; the same defect must be present 
somewhere in Boltzmann's likelihood, called 
a probability and distrusted as such, but never 
analysed as carefully. Of course the expression 


Q= TT ps 


(where fg is the frequency of the state s, in 

a finite population of N systems) is a probability, 
that can be multiplied for independent events, 
and added for disjoint events, and can be con- 
sidered as the product of probabilities of the 
individual events that each of the N systems be 
in the state tt is in. True also, Shannon's 
fundamental lemma of information theory (a 
corollary of the weak law of large numbers), 
assets that, for large enough N,and except for 

a set of events of total probability as small as 
one wishes, the probability of all distributions of 
energy is as close as one wishes to 


exp (- NH)= Tht 

The number of ''complexions" for which there 
are f, systems in state s, being (C= Ts! /N!, 
the total probability of complexions character- 
ized by f, is Q@ . It is seen to attain a maximum 
if fs = N ps. But the probability of the most 
probable partition is not @C_ , but (max p.)N. 
Besides, there is no sense in considering 

as a product of N times the "probability" Ths ; 
the latter expression, not involving asymptotics, 
is not a probability of any event. At best, it could 
be the "probability" of a probability distribution, 
but formed in its own probability field! But 
mostly, it ceases altogether to be a probability, 
when one compares different sets of ps, and 
becomes a likelihood, in the exact Fisher's 
sense. In fact, this liklihood seéms to be even 
more general than in the above application, 
since it is not limited to a parametric family of 
distributions; but a posteriori, Lagrange 
multipliers show that the most likely distribution 
is bound anyway to be one with a single para- 
meter. Note that the com parison of probabil- 
ities of two distributions in their own fields is 
made less absurd by that:d¥Prlogp, = Ldpnlog py: 
actually one chooses one of the two fields,as 
terrain of comparison. Note also another 
dubious feature of the probability arguments 
involved in Boltzmann type of maximation, 
which appears in quantum theory: one does 

not obtain probabilities, but probable values of 
the numbers of appearances of each alternative. 


A final defect of maximation derivations of the 
MB pdf, by contrast to the sufficiency derivation, 
is that successive applications of Stirling formula 
and passages to the limit hide the exact domain of 
applicability of the result obtained, 


5.2 Second method of estimation of B : the Bayes 
method. Choice o u), Minimum risk. 


Bayes estimates. The oldest method of est- 
imation, against which the maximum liklihood 
method was initially set up, was Bayes's method, 
based upon the assumption that,(on the basis of 
reasons of believing that some values of a para- 
meter are more"frequent"than others) one may 
set up prior probabilities for B, which thus 
becomes a random variable. Then one may 
compute posteriori probabilities, and estimate 
B to be the mean or a posteriori most probable 
value. Note that one may multiply the prior 
probabilities by any constant, without changing 
the posterior mean or most probable value. 
Sometimes this is used to justify using ''weights" 
for the different prior states, which do not add to 
any finite number: the axiomatic basis for such 
conditional probabilities was given by Rényi.34 
Recall that Boltzmann's method did not involve 
any prior probabilities: it is another reason for 
considering an improper its specification as a 
theory of the "most probable". 


The belief in the necessii. of prior probab- 


ilities"estimation(in "going from effects to causes") 


is very widespread in physical papers. This was 
von Neumann's point of departure in his 1932 
axiomatization of the quantum theory; he also 
notes that, since there are different possible 
prior probabilities, the problem does not have 

a unique solution; but finds that a certain prior 
distribution is "distinctly different from others", 
and bases the whole theory on it. Gibbs on the 
contrary warned of the impossibility of deter- 
mining probabilities for prior events on the basis 
of posterior ones. Others, such as Fowler, use 
widely the formalism of weights of the Bayes 
theory, but insist that these have nothing to do 
with the time a system spends in a state. 


In general, Bayes estimation is further based 
upon uniform prior probabilities; this is supposed 


to result from a principle of "unsufficient reason", 


There seems therefore to be a trace of the Bayes 
appraoch to physics any time that cells are 
assumed equiprobable. Consider as an example 
the method of estimation of B,which is implicit 
in Darwin and Fowler's method of mean values. 
There, one takes the mean value of, say, energy, 
assuming all partitions of energy to be equiprob- 
able. Then, one finds that a parameter B intro- 
duces itself through a mathematical trick, and 
altogether one estimates this parameter as 
being the one for which the mean value of energy 
were the observed value. 


200 


Choice of S(u). There is no need to give here 

a list of modern methods of statistics, free 
frorn both prior po babilities and likelihoods, 
since we are rather reviewing the conventional 
thermostatistics, in the light of the kind of 
statistics it involves. One should note however 
that Bayes’ thinking, in the form used by von 
Neumann, is also present whenever one 
justifies the MB distribution, as being invariant 
by shocks of a certain class . This is so, 
because, from the view-point of the kinetic 
theory, the energy is no random variable, as 

we have stressed it in §4, and making it such is 
as criticable, as making a parameter into a 
random variable: to stress the analogy, imagine 
that the "probability a priori" is that of a 
distribution of energy at an instant of time, 
while the ''probabilities proper" are the 
transition probabilities between different 
distributions, Therefore, whichever the progress 
of statistics, the critique of Bayes's approach, 
in a given time, will be required in thermo- 
statistics, Since our approach is not kinetic, 
this is not a problem here, but in fact Bayes 
thinking may also be used to determine 

(or rather choose") the structure function 

S(u). The choice of this function by the observer, 
who knows u, is a basic aspect of the arbitrarin- 
ess of estimation of B, which we have not 
stressed enough so far, since in principle it 
does not belong to estimation, but to the con- 
struction of physical models. The present point 
is that such models are often based upon 
“unsufficient reason" arguments, akin to those 
used in Bayes estimation. In fact, the whole 
theory of degeneracy is based upon it, since to 
identify the structure function S(u), with a 
number of states of given energy, (a number 
that can be calculated from geometric and com- 
binatorial considerations), one may postulate 
that when the energy available, and therfore B, 
tend to infinity, the ratio of the probabilities of 
any two states must tend to one. This is close 
to ''unsufficient reason", the list of states 
changes when the "reasons" change. (quantum 
statistics). 


A remark on sufficiency and smallest risk 


Consider now all those criteria of estimation, 
which aim to minimize a certain average(Bayes) 
"risk", linked with occasional wrong estimates. 
Under those conditions BoHenblust, Shapley 
and Sherman have defined an experiment "a"' as 
being not less informative as experiment"b" if 
any risk attainable with experiment b, could also 
be attained with experiment a. Blackwell and 
Girshick35 have used this criterion to compare 
the measurement of the Uj of a set of parts, 
and of the total U alone, when u is a sufficient 
statistic. He has shown that if the number of 
alternatives is finite, the measurement of u is 
not less informative than that of the uj's. The 


result is most presumably still true for a 
continuous infinity of alternatives, under some 
regularity conditions, which may be the same as 
those leading to the M.B. canonical distribution. 


E. Lehmann has considered tests of hy- 
po theses,such that'B is in a certain interval", 
He shows that when a sufficient statistic exists, 
any test based on a function of the measures, may 
be replaced by a test based on the sufficient 
statistic, without increasing the power function, 
that is, the probability of wrongly accepting the 
hypothesis. 


Al¢o mention the following phenomenon 
(for definitions and proof, see Kendall3¥ [I 281); 
"If a system of uniforraly most powerful tests 
exists, and if any point in the sample space lies 
onthe boundary of a best critical region then a 
sufficient estimator exists for the parameter 
whose variation provides the admissible alter- 

natives)! 


§6. The arbitrariness in estimation 
vanishes asymptotically. 


6.1. Passage to the limit of very large systems, 
i.e, very large log ¢ 


Since there is no motivation @ priori, in 
classical probability theory, forthe Boltzmann = 
Fisher estimates of ''rational Belief", these 
estimates should be judged by the properties they 
turn out to have. Infact, they have extremely 
satisfactory asymptotic properties. 


In thermostatistics, they converge to the Dar- 
win Fowler estimates; when the size of the sys- 
tem tends to infinity. This was proved by Darwin 
and Fowler by the method of steepest descents; 
later by Khinchin, by the method of local limit 
theorems of probability. Therefore, since the 
physicist is not interested in small bodies, the 
Boltzmann procedure is justified by its simplicity, 
and by its convergence to theories considered as 
"quite correct! 


In statistics, they behave well for large 
samples. Chiefly, they are consistent (they 
converge to the true value, with probability 
approaching unity, as the size tends to infinity); 
they are normally distributed around the true 
value, and their variance is the smallest possible 
(efficiency), Given the fundamental arbitrariness 
of estimation criteria, one could not Say that any 
other procedure is more correct than this one; $0, 

whenever itis the simplest, (for example 
when a sufficient statistic exists) and since it is 
at least as good as any other, one can use this 
procdeure for large samples, 


201 


‘ 


6.2. Small systems 


As their main drawback, from the viewpoint 
of presentiproblems of statistics, maximum 
likelihood estimates lack reasonable small 
sample properties; and among the class of 
consistent, asymptotically normal, efficient 
estimates, they may not be the ones that con- 
verge fastest. The statistician cannot any 
more justify the definition of an estimation 
procedure by its asymptotic properties. 


However, the physicist is not troubled by 
this, and continues to take the best from the 
facilities, permitted by the "fortunate insensi- 
bility of thermodynamic functions", stressed by 
Lorentz. 


6.3 Possibility of constructing observation 
models o ermostatistics. 


This fortunate insensibility is the very key 
to the possibility of constructing models in 
physics, which are based upon conditions of 
optimality of observation, despite the logical 
arbitrariness of what is ''the most favouyable", 
In the general case, this is impossible; the 
criterion of an engineering problem can be 
changed at will, without its ceasing to be 
another acceptable engineering problem, But 
the state of nature,most favourable for one 
problem)may not be so for any other, so that 
the fact,of being most favourable,is not in 
general a way of describing a real physical 
situation, One may say that each time one 
describes(and can fully characterize)outside 
circumstances,as being the most favourable 
for somebody, one builds a new anthro-- 
pocentered physics;(this is also the case when 
one assumes that the opponent in a zero sum 
two person game plays a minimax strategy 
against you, choosing your matrix of imputations) 
To invert methodologically an engineering 
problem,is to identify this physics with an 
actual one. But how to defend the model so 
obtained, if the criterion of the engineering 
problem is irreducibly arbitrary? 


The whole point is that arbitrariness exists 
anyway in thermostatistics, and thay is exactly 
the same than in statistics. This is"in no way 
surprising, in view of the way in which probab- 
ilities were asisgned to hypothetical ensembles 
in thermostatistics3 to"agree with our partial 
knowledge’ ( a fact linking irreversibility not to 
incompleteness or inexactness of mechanics, but 
to incompleteness of specification). Criteria 
of equipartition are. only"reasonable’ criteria, 
never considered as anything but mathematical 
abstractions of reality, or in other words as 
"fascinating" mathematical tricks, but still 
tricks; and they may become very dangerous 
conceptually when "asymptotically slight" 
differences go undetected altogether, and 
entie ly different concepts get exchanged 


because they lead to the same numerical results. 
Entropy is the worst sinner in that respect, and 
it will be studied in that light in §7 (second part 
of this paper). 


It is a pity,of course, that the phenomenological 
and statistical approach can only be franker about 
doubtful points, can help circumscribe and better 
understand them, but~nnot help eliminate them. 
Statistics may be sorry for it too, because it makes 
hopeless any belief in an improvement of the 
choice among statistical criteria, on the basis ot 
their appropriateness for physics. 


Conclusion to part I 

In part II of the paper, we study the problem 
of the informations and entropies: in part III, the 
problem of the possibility of interpolative sub- 
division of physical systems, and in part IV, the 
generalization of the whole to systems with ex- 
change of energy, to quantal systems, and to non- 
parametric systems. However, it can already be 
seen how thermostatistics has followed a develop- 
ment parallel to mathematical statistics. Recall 
how late (Khinchin$3) the parallel development of 
probability theory and of termostatistics have 
converged; statistics and thermostatistics have 
been even slower, although statistics and prob- 
ability have long converged, and could have per- 
ofrmed the link. 


In the comparrison, statistics appears as by 
far the more advanced discipline. Thus mathe- 
matical and methodological help turns out to go 
the "wrong" way, up the scale of sciences of 
Auguste Comte, since the modern small-sample 
statistics was made necessary in biology, agri- 
culture, social science and industrial acceptance 
(where asymptotics are of no help, and the pro- 
cess of observation of so great importance that te 
there is no difficulty in accepting theories centered 
around the observer). Statistics became necessary for 
the physicists only when the problem of noise 
reduction had to be faced: this is the historical 
reason for the context of this reasarch. 


It is in a sense very discouraging to have found 
on this occasion, that the wish for conceptual co- 
herence of the theoretical physicist had been 
weaker than the practical needs of 'automatisation 
of thought processes in inference" ("cybernetics"?); 
The moral is that the foundations of thermo- 
statistics should be reviewed critically, after each 
important progress in the mathematics or method- 
ology of other statistical disciplines. 


Acknowledgements. The author wishes to thank the 
Rockefeller Foundation and the Centre National de la 
Recherche Scientifique (Paris), for research grants, 
during parts od the time while this paper was written; 
and Miss Knight, for secretarial help. 


202 


1 


References of Part I 


B. Mandelbrot : Simple games of startegy, 
occuring in communication through natural 
languages; Trans. I.R.E. Inf. Theory 3 
(1954) 124 


B. Mandelbrot : Statistical structure of 
language; Communication theory; Academic 
Press (1953) 486 


B. Mandelbrot : Jeux de Communication; Publ. 
Inst. Stat. Paris 2 (1953) 3 


B. Mandelbrot : Structure formelle des textes; 
Word (New York) 10 (1954) 1 


B. Mandelbrot : Thermostatistics of Willis 
systems; Information Theory; Academic 
Press, (1956). 


C.E. Shannon : Mathematical theory of com- 
munication; Bell S.T.J. 1948 


J. Neyman : Lectures and conferences on 
mathematical statistics and probability 
Department of Agriculture (1952), chap. 4 


Leo Szilard : Uber die ausdehnung der pheno- 
menologischen thermodynamik auf der 
schwankungserscheinungen; Z. Physik 32 
(1925) 753 oR 


J.L. Lawson, G.E. Uhlenbeck : Threshold 
signals; Mac Graw Hill (1950) 


A. Kolmogoroff : Bull Ac Sc URSS Math 5 
(1941) 3 


N. Wiener : Times series; Wiley (1949) 


P.M. Woodward : Theory of observation TRE 
Journal (1949) 


J. von Neumann : Mathematica principles of 
quantum mechanics, Princeton (1954) 


R.A. Fisher : Contributions to mathematical 
statistics, Wiley (1950) 


A. Sommerfeld : Thermodynamicg and statis- 
tical mechanics; Academic Press (1956) 


G. Darmois : XXII Session de l'Inst. Int. Stat. 
Athénes (1936) 


B.O. Koopman : Trans. Am. Math. Soc. 39 
(1936) 399 oy 


E.J.G. Pitman: Proc. Camb. Philo. Soc. 32 
(1936) 567 hai 


G.N. Lewis : J. Am Chem Soc. 53 (1931) 25 


R. Fowler, E.A. Guggenheim : Statistical 
thermodynamics; Cambridge (1952) 


E.A, Guggenheim : Thermodynamics; North 
Holland (1949) 


de Groot : Thermodynamics of irreversible 
processes; North Holland 


L. Onsager, S. Machlup : Phys. Rev. 91 
(1953) 1505 a 


M.A. Girshick, L.J. Savage; Second Berkeley 
Symposium Stat. (1951) 53 


25 P. T. Ehrenfest; Encyclopedie Sc. Math. IV 1, 6 
(1915) 188 

26 G.E. Uhlenbeck; Lectures at Princeton, Les 
Houches, etc... (1955) 

27 J.W. Gibbs : Statistical Mechanics : Yale (1903) 

28 M. Kac: Third Berkeley Symposium Stat. III (1956) 

29 L. van Hove : Physica 21 (1955) 517 

30 D. Bohm : Phys. Rev. 85 (1952) 166 

31 L. de Broglie : Comptes-rendus since 1952 

$2 J.L. Vigier : Interprétation causale de la mécanique 
quantique; Gauthier Villars (1956) 

33  A.I. Khinchin : Statistical Mechanics; Dover (1949) 

3% = «OA«. ORényi : Acta Math. Hungarica (1955) 

35 D. Blackwell, M.A. Girshick : Theory of Games and 
Statistical Decisions, Wiley (1954) 

36 M. Kendall ; Theory of Statistics; Griffin (1948) 

37 C.R. Rao: Advanced Statistical methods; Wiley (1952) 

Appendices 


A.1. (after >", § 40.5) 


Let us assume that uj are independent observations 
from a population, with range independent of B. 

It can be shown that a general necessary and 
sufficient condition, that a distribution admits a 
sufficient statistic is the Fisher-Neyman relation : 


I p (uj/B) = P(Q(uy, uy, ... uy), B). R(vy, tg,... uy) 


where P (Q, B) is the density of the statistic Q, and 
R, is independent of B. 


Assuming all functions partially differentiable with 
respect to B, write : 


e dlogp(ui/B) 39 BoREIGE) = ws) 


Since this holds for all B, any value of B can be 
substituted, to obtain the relation 


v= Buy) = V @) 


connecting Q and the statistic v = B v(uy). If T(B) 
and v(u) are differentiable functions, it follows that 


du ‘ dv(ui) = dv(Q) dQ 
du; duj dQ du; 


Also, 


d?} 


je otha, 


Therefore, for all i, 


dW(QB) dQ. ui/B 
dQ du 


203 


2 
d'logp(uiB) , dv(ui) | dW(QB), AVQ). 1p) 
dBdui dui dQ dQ 


a function of B only; Integrating with respect to Q, 
W(Q, -B) = L(B) v(Q) + M(B) 
and then with respect to B, 


Elog p(ui/B) = X(B)V(Q(Ud,...ur) + ¥(B) + A(uy...uz) 
p(u /B) = G'(B)S(V(u)exp(-BX(B). v(u)) 


choose the scales so that X(B) be B, and v(u) be u: 
v is clearly additive. 


A.2 (after Rao”’, § 4a. 2) 


Let II p (uj) B) be the probability density of the 
observations and V (uj; ... uy) be an unbiased estimate 


of f{(B), a function of the parameter B of the density. 
Then : 
S///N ui. ..0y) p(ui/B) dul... du = £(B) 


If the limits of integration do not involve B, 
differentiation under the integral yields : 


S{//V, .. 201) a_p{ui/B) du}... duy= ote ) 
By Schwartz's inequality : 
Lifes) — 10)? 8.x GE, 
> df(B) 2 
. dB 


Then taking I= 1, the Fisher's information inequality 
follows : 


D(U) ? (df(db)2/F 


A.3 (after Rao Ab § 4a, 3) 


The only equality case od Schwartz's inequality is 
when : 


V(uj...uj) - £(B) = 1p) alot 


where L is a function of B only. This is a differential 
equation, having the soluticn 


VX(B) + Y(B) 


where X(B) Y(B) are functions of B, and A is 
independent of B. Hence 
p(u/B) = G ’ (B)S(V(u)) exp( - V(U)X(B)) 


It is easily seen that reciprocally, the lower limit to 
variance is attained by this districution. 


A RADAR DETECTION PHILOSOPHY 


William McC. Siebert 


Research Laboratory of Electronics and Lincoln Laboratory 
Massachusetts Institute of Technology, Cambridge, Mass. 


Abstract 


This paper is an attempt to present a short, 
unified discussion of the radar detection, para- 
meter estimation, and multiple-signal resolution 
problems--mostly from a philosophical rather than 
a detailed mathematical point of view. The pur- 
pose is, hopefully, to make it possible in at 
least some limited sense to reason back from ap- 
propriate measures of desired radar performance to 
specifications of the necessary values of the re- 
lated radar parameters. Specifically four mea- 
sures of nerformance quality are considered: 


1. The reliability of detection, 
2. The accuracy with which target para- 


meters can be estimated, 

The extent to which such estimates can 
be made without ambiguity, 

The degree to which two or more dif- 
ferent target echoes can be separated 
or resolved. 


It is argued that the radar synthesis pro- 
blem can be split into two more-or-less indepen- 
dent phases. First,adjust such parameters as 
those appearing in the radar equation so that the 
received signal energy is sufficiently large for 
the degree of reliability of detection desired. 
The required value of energy is almost entirely 
independent of the character of the received echo 
Signal waveform, The second phase is, then, ta 
select the waveform in such a way that accuracy, 
ambiguity, and resolution requirements are met. 
The limitations on what can be achieved in terms 
of these three quality measures are discussed in 
relation to an uncertainty principle. For pur- 
poses of illustration several novel waveforms 
having unusual and useful properties are des- 
cribed. 


Most radar design engineers today are ac- 
quainted with at least the rudiments of statisti- 
cal methods and probability concepts. They have 
studied aspects of detection theory and mastered 
the operational methods of signal and system 
analysis. They speak knowingly of "matched fil- 
ters" and “uncertainty principles." But often 
this knowledge is fragmentary--quite useful for 
radar analysis, but fundamentally inedequate for 


* The research in this document was supported 
jointly by the Army, Navy, and Air Force under 
contract with the Massachusetts Institute of 
Technology. 


detection philosophy. 


20h, 


radar ‘synthesis. What is often missing is a sense 
of perspective, an appreciation of the relative 
importance of and the interconnections between 
isolated bits of knowledge--in a phrase, a radar 
The rather ambitious pur- 
of this paper is to attempt to state such a phil- 
osophy--or better a part of such a philosophy 
since we shall ignore many aspects of the radar 
detection problem. Specifically we shall not 
even méntion many practical questions such as 
implementation, effect of system instabilities, 
approximations, and distortions, countermeasures, 
etc. Moreover, we shall for various reasons to 
be discussed consider only "search" type radar 
applications. 

Our intention, thus, is to discuss a theory. 
Now, the ultimate purpose of any theory in ap- 
plied science is always to achieve some type of 
synthesis, i.e., to make it possible to reason 
back from effects to causes, or from desired per- 
formance to system parameters. To be successful, 
then, our theory must meet three conditions: 


1. The model on which the theory is based 
must at least approximately represent 
the actual physical situation; 

The theory must yield a fundamental, 
complete, and consistent set of para- 
meters and concepts in terms of which 
both the desired performance and the 
radar system can be uniquely specified; 
‘The theory must include all upper bourids, 
limiting relationships, or realizability 
conditions, which prevent the. simultan- 
eous achievement of an arbitrary set of 
parameters, 


ve 


These three points constitute a rough out- 
line of this paper. More specifically we shall 
first postulate a model of both the target situa- 
tion and the radar. We shall then consider two 
very restricted special cases corresponding to a 
single target with discrete parameter distribu- 
tions. bespite the restrictions, a discussion of 
these cases will lead to quite general statements 
about those parameters and limiting conditions 
which relate to the general question of reliabil- 
ity of detection. Next we shall pass to the case 
of continuous varameter distributions and con- 
sider the questions of accuracy and ambiguity 
and their relation to what might be called the 
Radar Uncertainty Principie. This section will be 
illustrated by a number of examples, some of which 
are rather novel. Finally we shall consider brief- 
ty and qualitatively the case of multiple targets 
and the question of resolution. 


1.0 The Radar Motel 


We shall begin with the postulation of a 
model. For detection purposes we are -princivally 
interested in the received waveform and the way 
in which it is related to the parameters of the 
‘targets .and ‘the transmitted waveform. We can 
avoid a number of trivial ‘steps in the argument 
if we choose our postulates so as to define di- 
rectly the received waveform. Hence, we shall 
assume that: 


1. The volume examined by the radar con- 
tains a number of point scatterers whose 
individual properties can be completely 
described by 

a. the amplitude of the return from 

each, 
b. the range of the scatterer or de- 
-lay in the echo, 
c. the velocity of the scatterer or 
Doppler shift in the echo; 

2. The effective duration of the echo from 
each scatterer is limited, known and inde- 
pendent of the properties of the scatterer; 
3. The amvlitude of the echo and the velo- 
city of the scatterer are constants, at 
least during the effective echo duration. 


A fourth assumption is required to specify 
the actual shape of the received waveform, but 
since this is primarily ‘a matter of nomenclature 
and requires some development we shall postpone 
it for the moment. It is easy to raise questions 
about the necessity, rationality, and implications 
of the assumptions 1isted above. Nevertheless, 
we believe that they are the simplest set of con- 
straints which preserve, at least in some rudi- 
mentary form, the major aspects of the radar pro- 
blem. Moreover, they are the most common as- 
sumptions, implied if not expressed, in most dis- 
cussions of the radar problem, and they have in 
general the pragmatic justification of leading to 
mathematical and pnilosophical problems which are, 
at least in principle, amenable to solution. Of 
course, any set of assumptions of this type limits 
the applicability of the theory to follow. In 
this respect the second assumption is perhaps the 
most damaging since it effectively limits the 
theory to "search" as opposed to "track" appli- 
cations. Actually there is a rather important 
point of philosophy involved here about which we 
shall have more to say later on. -Moreover, an 
additional implication of this second postulate 
is that we must effectively assume ‘either that the 
antenna is step-scanned or that the angular co- 
ordinates of the target are known a priori. In 
general we plan to ignore the problem of measuring 
the angular coordinates of the target. There is 
nothing particularly fundamental: about this and 
the theory can be modified to include angular 
measurements with, of course, an increase in com- 
plexity. The first and third assumptions are, 
perhaps, less controversial, and essentially 

‘amount to requiring that the observation interval 
or:duration of the. echo must be sufficiently short 
so ‘that acceleration and rapid ‘scintillation ef- 
fects ‘can be ignored. 


Returning now to the fourth assumption, we 
aesire to introduce some symbolic notation for the 
received waveform and its relation to the trans- 
mitted waveform and the target parameters. ‘he 
first three assumptions permit us to represent the 
echo signal tereaed at any moment in terms of a 
canonic signal 


at) = Ke JB Ae Jompeed| (1-1) 
where ol 
Gy eee = 
a = carrier frequency 
so that 
stb- [ZO/cosluct 4+ yh)  (1-1a) 


The additional assumption will be made that LS 
and yl) vary slowly compared with wet so that 
the usual narrow-band -assumptions will be justi- 
fied. For example, it iS conveneint to normalize 
the energy level of s(t) by specifying that 


[sstodt ad [/So[at= / 


Physically, 3 4) will be interpreted as the re- 
turn from a "unit" fixed target at zero range. 
Consequently, excent for the eftect of antenna 
scanning or equivalent means of limiting the echo 
duration, ALM and Yt) can be considered as 


(1-2) 


‘the envelope and phase variation of the transmit- 


ted waveform, and hence along witn @., are as- 
sumed known to the receiver. ‘thus, using the 
narrow band assumption, the echo from a target 
having a range delay @ and a Doppler frequency 
shift w rad/sec,w<<we,will be represented as 


ADE RIAL Coe Oe, 
where 
ya Si 
so that 
3%, C=) * 
(1-3a) 


= PIS be- rWeos [entu)G-r) a ypaer)/ 


1. Complex notation is employed for simplicity 
at a later stage and should not be considered in 
any sense fundamental or mysterious, 


2, Although the limits of integration in (1-2), 
as in other similar integrals to follow, are 
given as -om and o, it should be remembered that, 
in accordance with the second basic assumption, 

s fh is assumed to be zero outside some rela- 
tively short interval of time. 


205 


and os 


ye 63 lé-t)dt = P* 


-_~ 


(1-3b) 
= echo signal energy. 


Finally our fourth assumption is that the 
total received signal can be represented as 


rl)= nl + > su, (6-22) 
: (1-4 
= lee 5 fe ly, le Jehe ©O 


where 4» &) represents essentially white Gaussian 
noise of known power density, No watts/cps 
(double-sidea spectrum) 


) 


{1) 


2.0 The A rosteriori Probability 


Given ef our problem roughly is to ae- 
termine the number and parameters or the targets 
whicn are present. Of course we cannot expect 
to do this with absolute certainty because of the 
noise, if for no other reason. The best that we 
can hope to achieve is to be able to attach a 
probability to the truth of any proposition made 
about the target situation. Specifically we shall 
assert that a specification for each possible set 
of values of A, ® and & of the probability 
that there is a target present with. those para- 
meters constitutes a complete statement of our 
knowledge about the target situation obtained 
from »&(#) . Considered as a function of 4, % 
and éJ, this is called the a posteriori (2.60, 
after reception of rlt) ) probability distribu- 
tion which we shall denote by PLA Ew), 


However, as soon as we try to compute Pit 
for some particular rf) , a difficult appears. 
This is, of course, the fact that A“~@u~/ for 
each particular AZ®, and @~ depends on our ex- 
pectation prior to receiving +» that a target 
with that particular set of parameters would be 
present, i.e., AMBw/ depends on the a priori 
probability distribution AAox), Other things 
being equal, the more likely a target is before 
receiving +f) , thé more likely it is after- 
wards. Specifically it can be shown that if 7) 
is white Gaussian noise and if we kmow, for 
example, that there is at most one target present, 
then the a posteriori probability that a target 


is present and has the particular parameters At; 
and & is 


1. Contrary to the more usual practice we shall 
employ throughout a double-sided spectrum, i.e., 
including both positive and negative frequencies. 
Hence, for example, the noise power output of a 
filter with frequency response /4/fw) will be 


Ea Ug "deo. 


For a receiver with noise figure =f, Vs is 
given by #*f4Z 47 , A* Boltzmann's constant, 
7 = absolute temperature. 

[1]. Superscripts in brackets refer to the 
bibliography. 


PUR v, 01) = 


(2-1) 


2 Si) wlteadt _ 
peal ss ; e ie 


whereA = normalization constant, independent of 
A, 2&, and w , so chosen that the sum or inte- 
gral of “@t,w) over all possible values (in- 
cluding AsO) of the parameters is l. 


The difficulty, of course, is that in many 
cases the radar designer--and indeed the radar 
user--have little or no knowledge about B@® ow), 
To quote an example of Woodward's, what after all 
is the "a priori probability of observing an air- 
craft on a given radar set at a range of ten miles 
at nine o'clock tomorrow morning?" It is not 
even clear that this question has any meaning in 
the sense of mathematical as. opposed to subjective 
probability, since an ensemble of like situations 
is hard to imagine, 


It has been argued with some justification, 
that, from the the point of view of radar design 
at least, the dependence of AAZw) on CAG» 
is not very important. The argument essentially 
depends upon two observations: 


1. The only way in which the particular 
signal received adds to our knowledge of 
any attribute of the situation is through 
the integral 


Ae [r@silb-cde 


Thus any receiver which computes the in- 
tegral (2-2) for all possible values of @ 
and é& is preserving all of that informa- 
tion in #(é) relevant to any decisions 
about the presence or parameters of an echo 
signal. Furthermore, the output of this 
receiver is just sufficient in that any 
further operations on #4 will either des- 
troy useful information or imply assumptions 
concerning the torm of 4@@twJ, It is in 
this sense at least that a receiver perform- 
ing the cross-correlation type operation 
(2-2) may be said to be optimum--and the 
structure of this receiver does not depend 
on BUA Ga, 

2. The way in which A@@& enters the 
computation of AAGw is purely as a 
scale factor or weighting. Thus its influence 
on the equipment design is essentially 

that of a gain control adjustment waich need 
not greatly worry the radar designer since 
it can be left to the user to set this con- 
trol in accordance with the situation and 
his own particular prejudices. 


(2-2) 


At best it seems to us this argument amounts 
to ducking the issue. There are two reasons why 
we cannot get rid of the a priori probability pro- 
blem so easily: 


206 


1. Sooner or later in every system design 
it is necessary to make decisions, e.g., to 
procede from noncommital probabilities to 
firm statements right or wrong, that targets 
are located there and there and there. Such 
decisions are the rightful concern of the 
radar designer, if for no other reason than 
that he may very well be called upon to make 
them. But as soon’as decisions are re- 
quired the a priori difficulty comes back to 
haunt us--mixed up now with another dis- 
turbing subject, tne question of risks or 
value judgments. It is idle to pretend that 
decisions based on maximum likelihood, Ney- 
man-Pearson, or minimax criteria avoid the 
difticult since the selection of such a 
criterion really amounts in eftect to an 

an assumption about the torm of BQ) ° 


2. The values of V4, 2%) tor all possible 
sets of 4 @ and # represent quite a lot 
of data--more, indeed, than the customer may 
wish or can assimilate. Some of these para- 
meters, e.g., 47 , may not contribute much 
information and the customer may very well 
suggest that the designer get rid of them 
by integrating “7A zw) over these para- 
meters. In this event the necessity for 
knowing or assuming AA Ze) is completely 
unavoidable. 


Throughout the remainder of this paper we 
shall assume that “@(@@eis known or has been 
more or less arbitrarily selected. This really 
constitutes, of course, a fifth postulate and 
should perhaps be listed in the preceding section. 
In the last analysis the justification for this 
assumption lies in the fact that we are dealing 
with a theory. In the modern axiomatic sense of 
the word a theory cannot be tested on the basis 
of its truth, but only as to its utility. The 
usefulness of the present theory, including the 
assumption of known a priori probabilities, has, 
we believe, been demonstrated many times. 

In certain cases the required assumptions 
concerning the a priori distributions are rela- 
tively innocuous. For example, suppose again 
that there is known to be only one target present 
and that A*W% is.known to be reasonably large 
compared with 1, Then A as defined in (2-2) 
will, as a result of the narrow-band assumption, 
be an almost periodic function of ® in the vi- 
cinity of the true values of @ and w& , and, 
indeed, will be almost a sine wave of frequency 
We and amplitude A7/“™% . The corresponding 
period, expressed in terms of range, is one-half 
the wave length at the frequency «t/a, i.e., in 
most cases a few feet or less. Appealing to a 
sort of general principle of continuity, it is 
certainly reasonable to assume that AC@owis 
essentially constant over variations of @ cor- 
responding to as small a range difference as a 
few feet. Thus AWA ox considered as a function 
of @ can be expectéd to alternate rapidly be- 
tween a large and a small value. Physically the 
implication is that the range of the target can 
perhaps be determined quite "accurately" (small 
fraction of a wave length), but that this 


207 


determination is highly "ambiguous" (multiples of 
half-wavelengths). As a’result of this ambiguity 
the fine structure of AW#@%«) is essentially 
meaningless in most casesl and the logical oper= 
ation to perform is to replace Ae) by local 
sums of “C4 tj~J) over half-wavelength ‘intervals, 
assuming ranges within this interval equally | 
likely. Taking advantage of the narrow-band as- 
sumption, this operation is readily carried out 
by treating the «6? term in a 4-2 as an 
independent random phase angle, @ , assumed to 
be uniformly distributed 0S90<2M, Averaging. 
over @ we obtain 


‘x (2-3) 
«kh u) Pe To Ke) 


where 


N= [frOSl-aeity] (2-4) 


and Zé is the Bessel function of imaginary ar- 
gunent. P/B2w) can be thought of as es- 
sentially the "envelope" of AA Ex) given by 
(2-1). More precisely _ZG&/ is a monotonic 
function of « so that A, plays the same role 
of a sufficient statistic with respect to M#M&Sa) 
as A plays with respect to ZA@&w).- Now 
it is easy to show that A, considered as a 
function of @ may be interpreted physically? 
(in the case at least when YH)=O) as precisely 
the enveiope of A , which is certainly an in- 
tuitively satisfying result considering tne 
assumptions, 


1. But not in all cases, e.g., not if the radar 
being considered is one station of an inter- 
ferometer system. 


2. <A touch of reality can be given to this 
"physical" interpretation by considering one of 
the ways by which A and A, ,can be computed 

in practice. Let the interval in which suC®@) 

is non-zero be OS¢€7. Then / can be written 


T+ eo 


A= HO s.lé-odt 


—@ 
which can be interpreted as the output at time 
V+ee of a linear filter whose input is & 
and whose impulse response, ACM , is given by 


bl)= s.l7-t) 


This is the matched filter for this waveforn. 
Clearly the output of this filter as a function 
of time is precisely equal to “N° for various 
values of @. If Sawf¢) is a narrow-band 
waveform, then A(t) will also have the appear- 
ance of a narrow-band waveform, A, ed is pre- 
cisely the ordinary physical envelope of this 
waveform, i.e., can be obtained by following the 
matched filter with an envelope detector. 


In the Literature P/A Tus) has in general 
been called the a posteriori probability in the 
“random=-phase case." This is an unfortunate 
name since it has led to considerable confusion 
witn the already rather confused question of 
"coherent" vs. "incoherent" radar. We have gone 
through the argument leading to P& Tu) in some 
detail in the hope of pointing out that there is 
really no connection between these two ideas. 

The receiver computing A, is every bit as co- 
herent as that computing A in the sense -that 
complete knowledge of the internal phase structure 
of the expected received signai is assumed in each 
case--it is only the initial phase or detailed 
local range which is assumed random and equivocal 
in le < And, of course, in neither case is there 
any necessity that the transmitted signal nave 
some regular predictable phase structure ("co- 
herence pulse-to-pulse")--it is merely necessary 
in each case that the receiver be aware a poster- 
iori of what was indeed transmitted, and this is 
a condition which, at least. in principle, can 
always be satisfied in radar. 

f2) 


3.0 Simple Detection Situations 


In the preceding sections we have discussed 
the radar model which we have selected and the 
role which the a posteriori probability plays as 
a complete measure of our after-the-fact knowledge 
But, although the a@ posteriori probability, sup- 
plemented perhaps with some decision method, rep- 
resents a more or less complete solution to the 
analysis problem, we must go further before we can 
do radar synthesis. In particular we must con- 
sider the quality of our a posterior knowledge and 
the way it depends on the. various system parameters 
There are many sorts of quality judgments which 
might be applied. We propose to consider just 
four: 


1. The reliability of the detection or 
determination that a target echo is 
"there," 

2. The accuracy with which the parameters 
of the target echo can be measured, 

3. The possibility of ambiguities in the 
determination of target echo parameters, 

4, The extent to which two target echoes 
present simultaneously or overlapping 
can be resolved and measured separately. 


Of these four, the first--reliability of detec-' 
tion--clearly underlies or precedes the other three 
In order to acquire some feeling for the detection 
problem we shall first consider several situations 
in which the .a priori knowledge is, by assumption, 
such that the other three quality judgments do 

net apply. ane: 


nana a ee ee 


As we have mentioned before, the question of 
detection or decision brings up the problem of 
value judgments, i.e., the relative "costs" to be 
assigned to the various ways and degrees of being 
wrong. Fortunately there are several simple sit- 
uations from which it is possible to draw remark- 
ably general conclusions of great power and util- 


208 


onic detection problem. 


ity without having to get deeply involved in such 
a slippery subject as value judgments and decision 
criteria. The simplest of these is philosophi- 
cally almost trivial and might be called the can- 
We assume that it is 
known a priori that only one of two possible ter- 
get situations can ever occur--either there is no 
target present at all so that the received signal 
consists of noise alone, or else, one particular 
known target is present so that the received sig- 
nal consists of a known echo signal (i.e., know 
waveform, 7, @ and w plus noise. In this 

case there are only two a posteriori probabilities 
of interest-- D4, Tw) and Raga= /—PA fw) 
Since our complete a posteriori knowledge of the 
situation is thus specified by a single number, 
P(A ts), it is clear that the ohly rational 
decision process is a comparison of PW Gw) with 
a threshold--announcing desired signal present if 
Pl, tu) exceeds this threshold, and otherwise 
absent. Moreoever, since Zoé&&Jis monotonic in 
xX , a completely equivalent process, and one 
which is perhaps more easily interpreted,is merely 
a comparison of A, with a different threshold, o,. 
The choice of #” , of course, depends on the speci- 
fied values of 7”, 4246), and on the ap- 
propriate value judgments selected to rate the 
performance of the decision process. But this is 
the only way in which the questions of either the 
value judgments or the a priori probability enter 
the problem. Thus the significance of the a pos- 


2 


teriori approach, in this case at least, is that 


we can state quite unequivocally that the form of 
the optimum decision process, i.e., compare /\, 
with a threshold, will not depend on the particu- 
lar value judgment chosen, which is really a quite 
remarkable and important conclusion. Indeed if at 
least some relative degree of invariance to such 
an emotional quantity as value judgments were not 
obtained in as simple a decision problem as this, 
we would have very serious doubts about the likeli- 
hood of any really general and useful conclusions 
coming out of the present approach. 

“ The remaining result of interest in the 
canonic detection problem is a determination of 
those attributes of the received echo waveform 
which influence the decision performance. First, 
it is necessary to point out that the performance 
of the detector in this simple problem can be com= 
pletely characterized by two conditional probabil- 
TtLesy 

P, = probability of announcing ecno signal 
present if there actually is such a signal 
present = Probability of Detection; 

P, = probability of announcing echo signal 
present if there actually is not a signal 
present = Probability of False Alarm. 


1. Clearly such a situation is almost too trivial 
to ever be representative of an actual radar pro- 
blem. Nevertheless, certain communications pro=- 
blems, e.g., synchronous POM, are represented 
rather accurately by this model. 


2. Throughout this paper we shall assume that we 
are dealing with the "random-phese case" so that 

P(A zw) rather than A~BGw) is the appropriate 
probability distribution, 


It is easy to show by an analysis of the statisti- 
cal properties of ¢ that Pa and Pp are functions 
of just two parameters: 


2 
Fire A ma echo signal ener 
fo effective input noise power/cps. 
2 ike threshold voltage)* 
N, effective input noise power/cps. 


The parameter x7 can be eliminated and Pa plot- 
ted as a function of P, with Aas a parameter. 

The resulting family of curves (Fig. 1) has besn 
called the receiver detection characteristic. 

In accordance with the argument of the preceding 
paragraph the interpretation to be put on these 
curves is the following. For a given R any pair 
of values of P, and Pe lying on the corresponding 
curve can be obtained” by comparing A» with an ap- 
propriate threshold. The best operating point is, 
of course, a function of the selected value judg- 
ment. and, in general, Rw). But in any case 
the performance so obtained is optimum in the 
sense that no pair of values of P, and P, above or 
to the left of this curve can be Gh earaed by any 
means with the given AW. Any actual receiver 
which fails to compute “or its equivalent will 
yield operating points lying below this curve. 


9999 1 
ee eS a ZA 
1 ee cae A 


998 y | 
ae ey e/a 
Sg 4 as Da i al A 
ied aa a a 
oes 
at eer a a 
SE eet aA 
Se 
4 A Hae AA 
BAe A 
ee 
= paeraeaee 


is ae! 
000 ; 
00054 SAS BARRG, 
tome er Loe Doreen 10 Por Fo 
Po = Probability of False Alarm 
Fig. 1: Receiver Detection Characteristics 


1. It is important to point out, however, in view 
of the many arguments which have arisen over the 
past few years, that a receiver which can be con- 
sidered as only in the grossest sense computing A. 
may often (indeed one is tempted to say akmost al- 
ways) have an operating point only slightly below 
(equivalent to a few db change in A) the optimum, 
In other words the optimum represented by compar- 
ing As with a threshold is, in an operational 
sense, very broad. This is another example of. the 
relative invariance which we consider such a sat- 
‘isfying feature of. the theory 


209 


@ priori distribution of amplitudes. 


We save found the curves of Fig. i very use- 
ful for computing the performance of radars and 
other systems (as compared, for example, with the 
rather nebulous empirical rule that reliable" 
detection requires some signai-to-noise ratio to 
be greater than aie But for our present purposes 
the most important conclusion of the last para- 
graph is that the detector performance, in the 
canonic detection situation at least ana in so tar 
as it depends on the actual signal received, de=- 
pends only on the ratio of the desired echo signal 
energy to the noise power per cycle per second, and 
not upon any other attribute of the waveform (e.°%., 
bandwiath, waveshape, etc.). That the performance 
should in principle depend on the ratio, A® and not 
alone on the more common signal-to-noise power 
ratio is certainly not surprising in view of the 
Hquipartition Law of physics. But when translated 
into other terms, e.g., the observation that a 
pulse radar and a CW radar will have the same de+ 
tection performance on a given target for the 
same average received power and observation time, 
despite the large difference in bandwidth, the 
argument is not always so readily believed. 

Before going on to consider mére complicated 
decision probléms it is worthwhile to invéstigate 
the effect of an unknown amplitude,* A » on what 
is otnerwise the canonic decision probkem, There 
are, perhaps, two extreme situations--one in which 
it is desired to estimate A as well as detect tne 
presence of the signal, and the other in whicn the 
actual value of # contributes essentially no in- 
formation and only a yes-or'=-no answer about pre- 
sence is desired. The first situation brings up 
the question of accuracy; indeed, our discussion 
here will serve as the prototype for later dis- 
cussions with resvect to @ and ws, The second 
situation provides another example of the proper 
way to handle a "stray" or non-information-carry- 
ing parameter. In the second case, in particular, 
it is necessary to make an assumption about the 
For purposes 
of illustration we shall choose the Rayleigh dis- 
tribution which in many cases is a reasonable ap- 
proximation to the actual distribution and has the 
advantage of being easy to manipulate. That is we 
shall assume that 


P(A, Zw) = Row B Sc -£$ AA (3.1-1) 

If only a yes-no answer about the presence of 
a target is desired, then paralleling the random = 
phase arguments whet we must compute is 


Pew) ES [Fare dn (3.12) 


1. Indeed we consider the-most meaningtul and” 
fundamental form of the radar equation to be that 
which relates‘ , rather than received signal pow- 
er, to transmitted power, antenna gain, range, etc. 
2. It should be remembered that we are still as- 
suming that, whatever the value of AQ, it remains 
constant for the duration of the signal. In physi- 
cal terms we are thus modeling the slowly- scintil- 
lating target only. Rapidly-scintillating targets 
present much more difficult problems, 


i.e., the sum of the a posteriori probabilities 
for each value of #. The integration can be eas- 
ily carried out and it can be shown that a compar- 
ison of Ate with a threshold is completely 
equivalent to comparing “\y with a threshold as be- 
fore, The difference from the case of known amp- 
litude comes in the values of P, and Pp. The re- 
ceiver detection characteristic is readily compu- 
ted and the resulting family of curves is shown 
dotted in Fig. 1. The parameter in this case is 
the ratio Ae=4e Me , where % is the most pro- 
bable target echo amplitude. The most significant 
attributes of these curves are the much lower 
value of P, resulting for high values of Ave com- 
pared with the non-fluctuating case for the same 
value of A®*, and the effective saturation in P 
accompanying an attempt to increase Pa by increas- 
ing Ae, These results have an -important intluence 
on radar design, but further discussion of these 
effects is outside the scope of this paper. 
Suppose, however, it'is necessary to deter-) 
mine the actual amplitude, say AZ, of the target. 
The question then arises as to the accuracy with 
‘which #%:can be measured, The a posteriori prob- 
mbility. of a particular value, # , is, from (2-3) 


PAG wH)< ~ 
~ Afb Gp &3 0) e-ite ZF) 


Now as we have already shown the ratio @YA6 mst 
be large(sif the signal is to be detected re- 
liably. Or-alternately we shall show that 4*/We 
must be large if 4, is to be determined accurately, 
From either point of view we conclude that the 
interesting range of A in (3.1-3) is the néighbor+ 
hood of A and that WA and AA./Newill both be 
>>1. If AMew is continuous, we are probably 
entitled to assume that the variation of AA fu) 
in the neighborhood of 4 will be small so that the 
@ priori probability may be effectively included 
for present purposes in the constant k. Pape) 
i arene has a large peak at WA ; the precise 

ocation of the maximum, i.e., the most probable 
value of M , is determined by 


2(PMz)) _ o 


(3.1-3) 


v7) (3.1-4) 
or 
ALASE) -N.D.(FE)Ne20 (3.145) 


Using the fact that BGs. >> 4 and preserving only 
the first terms of the asymptotic expansions 


1, Situations in which the amplitude of the re- 
turn is a useful piece of information appear to be 
rare in radar systems. The commonest example, per+ 
haps, is in monopulse systems, and here it is not 
s0 much the amplitude of the echo signal for a 
single radar as it is effectively the ratio of the 
amplitudes for two radars which matters. It seems 
possible that future radar systems may perhaps 

find amplitude information useful as an aid to 
identification. 


x 3 
Zia) ~ me C/4 oF --) (3.1-6) 
e* 3 
Zl) ~. psa C/- Soe —- 7. (3.1-7) 
the solution of (3-1-5) is simply 
A=, (3.1-8) 
with an error the order of 44 As - Equation 


(3.1-8) is certainly reasonable. In particular 
we note: that as M7>GN.—A,. 

We new wish to focus attention on a series 
of cases in which the actual amplitude is A, and 
to consider the’distribution of values of A? which 
would result from (3.1-8). A study of the form of 
A. (see (2-4) and foofnote 1, page 2) shows that 
A, has the same Yistribution as the envelope of a 
sinusoid of amplitude 4 plus a narrow-band Gauss- 
jan noise with variance A%. This problem has been 
considered by Rice*4 who has shown that the distri- 
bution for A is approximately normal (for 7A,>> /) 
with mean value equal to &, as we should expect, 
and variance equal to “%. Thus the normalized 
effective standard deviation is approximately. 


SB - 7a (3.1-9) 


which is the result we sought. Typically Ris 
the order of 100 or more so that a relative ac- 
curacy of better than 10 per cent in the determi- 
nation of 4 is reasonable, 

Completely aside from the various approxi- 
mations employed, the procedure of the previous 
paragraph can be questioned on philosophical 
grounds. In outline what we did was the following: 


1. Determined that value of A, say A’, for 
which A@ zed had its maximum in the vicin- 
ity of A=4 ; , 
2. Found the probability distribution. of 4 
over all the received signals for which A = 
A. 


The difficulty with this approach is that we 
have specified the form of the estimation operation 
in:advance, i.e., choose that value of # which 
maximizes P44, tw), This is certainly a reason- 
able thing to do, but there are other possible 
reasonable operations, e.g., choose that value 
which will minimize (#-4)* on the average. We. 
cannot ‘be sure a priori that the values of A se- 
lected on the basis of different criteria might 
not be different and have different error dis- 
tributions. 

Fortunately there is another approach which 
gets around this difficulty and, in addition, is 
both easier to carry through: and more illuminating, 
The difference, essentially, is that instead of 
computing the distribution of some arbitrarily 
selected quantity such as M’= wsaxCPREed) over 
all received signals with fixed M , we shall con- 
pute the distribution of values of A? which might 
have led to a particular rf), But this is pre= 
cisely the a posteriori probability PA@S for 


210 


this particular »f4), Now, of course, PA Gu) 
can tell us something general about the accuracy 
of estimating M% only if A¥At;w), in the vici- 
nity of =A, at least, has essentially the same 
shape for nearly all received signals, +4), hav- 
ing the same A. But this basically implies that 
we can measure A with high accuracy which is ex- 
actly the condition we wish to achieve and are 
most interested in. Hence, the argument is self- 
consistent--if A/At~) has a large spike in the 
neighborhood of some particular A , i.e., if it 
is highly probable that a signal with AM, in this 
neighborhood is present, then the shape of AAt ~) 
in this neighborhood describes the accuracy with 
which we can determine 4 since FLAG) is ex- 
actly the probability that some A? other than 47, 
could have been present in + f@. In such a case 
P(A, Gx) will be determined almost entirely by 
the ecno signal and will be almost independent of 
the particular noise present so that we may make 
general statements. If PVA tw) does not have a 
large isolated spike then, although PYAGex) still 
measures the distribution of possibie values of & 
that might have produced the particular +f) re- 
ceived, the accuracy is presumably low and no 
general statements can be made since the noise had 
had a major effect on P/At uu). This same argu- 
ment applies to any other parameter as well as to 
#7 ~and is the one we shall employ tor @ and w ,. 

For the particular problem being considered 
P/Q t) is given by (3.1-3). As vefore we shall 
assume that A@¢ bw) is essentially constant over 
the interesting neighborhood so that 


sgt 
Plaguw) ~e #N 7% (8%) (3.1-10) 


Now the function /can be broken up into two 
terms py writing +I = nlt) + su0E-PD; 


A, i sult Sree 4 


[lS (e-Ie) Ge vi 


-o» 


(3.1-11) 


The first integral is equai to A, = amplitude of 
echo signal present. The supposition that PAu) 
has a large spike in the neighvorhood being con- 
sidered implies that the second term is with high 
probability small and that 4*wW, is large so that 


Plirw)~« MZ) 


(3.1-12) 
(a -4a)* | 
~ e- 2M (3.1=13) 


where we have replaced Ze CA Bf) by the first 
term of its asymptiotic expansion (3.1-6) and 
completed the square. Thus PMS) is approxi- 
mately normally distributed near & with mean =~ 
max [P48 t)[*4, and standard deviation dm =W 
=Wik 23 before. In addition to being simpler 
than the first approach, this method has, in | 
Woodward's words "the remarkable feature----that 


211 


[accuracy] can be studied’in the absence of noise" 
since the effect of our argument was to remove the 
noise term from Ac in so far as computing P(AtH) 
was concerned. 

In the preceding paragraphs we have consider+e 
ed nearly all the questions which can arise in the 
canonic detection problem, with the exception of 
those which more properly belong to the study of | 
values and which determine the actual value of 
the threshold to be employed. To summarize we 
have observed that; 

1. For the canonic. detection problem the 

optimum form of decision process is a com- 

parison of the envelope of the output of a 

matched filter or cross-correlator with a 

threshold. The form of this decision pro- 

cess is independent of either the type of 
value judgment selected or the a priori pro- 
babilities; 

2. The reliability of detection, in so far 

as it depends on the characteristics of the 

radar and the target, depends only on the 
ratio of received signal energy to noise 

power per cycle per second, i.e., on A; 

3. The accuracy with which the parameter 4 

can be determined is measured by the stan- 

dard deviation of AAG~) . For &>>/ (and 
we observed that A has to be much greater 
than one if the detection performance is to 
be reliable) the shape of A¢@e~) in the 
neighborhood of the correct value of A is 
nearly the same for all # and is essen- 
tially equal to AGew) with a,/¢-29 sub- 
stituted for ~M/, Specifically the stan- 
dard deviation of 4 is given by 


> 5 A 
We 
(2] 


3.2. The Case of M Orthogonal Signals 


We now wish to consider another simple de- 
tection problem which is somewhat more directly 
related to the radar problem than that discussed 
in the preceding section. As before we shall as- 
sume that at most one target echo signal is pre- 
sent at any one time or during any one observa= 
tion interval. We shall further assume that the 
amplitude, A , is known a priori and that the 
initial phase angle is random and informationless. 
However, unlike the preceding case we shall as- 
sume that the parameters @ and & of the echo 
signal are not known a priori but instead any one 
of M. signals of the form Asw,(¢-z)may be pre- 
sent with equal probability. We shall further 
assume that these AV signals are mutually ortho- 
gonal, i.e., that 


Jb a le 0:5, le 2; de» 


2:47) 


(3.2-1) 
O 3445 


where the star indicates comples conjugate. Al- 
though this set of. assumptions is clearly a closer 
approximation to actual radar problems than the 
canonic detection problem in that @ and# are 
treated as unknowns, the assumed discrete nature 
of the a priori distributions in @ andw& , and 


the requirements that (3.2-1) be satisfied, are 
obviously unrealistic. The next section will be 
largely devoted to removing these restrictions, 
but their value for the moment is that again we 
will be able to say something of general value 
about the reliability of detection without getting 
involved in the accuracy, ambiguity, and resolu- 
tion questions. Indeed what we hope eventually 

to be able to conclude--and this is one of the 
principle points of our detection philosophy--is 
that the radar detection problem breaks down into 
two essentially independent problems. The first 
of these is to adjust the radar parameters, par- 
ticularly those which appear in the radar equation, 
in order to obtain sufficient received energy un- 
der the desired conditions to produce reliable de- 
tection of the fact that.a signal is present. The 
important point is thet the required signal energy, 
or better the required value of ® _, can be ‘stated 
almost independently of the received waveform. 

The second problem, then is to adjust the received 
waveform, by means, of course, of choosing the 
transmitted waveform, in order to obtain the de- 
Sired performance in terms of accuracy, ambiguity, 
and resolution. 

For the moment, then, we are interested in 
the reliability of detection if there are M orth- 
ogonal signals which might possibly be present one 
at a time, instead of just one known signal. From 
the beginning it is apparent that the present pro- 
blem is philosophically considerably more involved 
than the simple canonic problem in that the a 

osteriori probability distribution (which is still 
given by (2-3) now. consists of a set of num- 
bers” instead of just two. In particular it is 
no longer possible to circumvent more or less com- 
pletely the question of value judgments--there are 
many meaningful and different ways in which-the 
question "Is there a signal present?" can be asked. 
For our present. purposes, however, as outlined in 
the preceding paragraph, it will perhaps sufrice 
to demonstrate the essential invariance of the 
system behavior by analysing two examples. 

The first of these is perhaps the most ob- 
vious formulation of the pure detection problem 
in the present situation. if we reduce the de- 
tection problem to a simple binary decision then 
the performance can be completely described. as 
before by the various probabilities of being right 
and wrong, such as P, and P,. In particular for 
our first example we seek the optimum receiver de- 
tection characteristic relating 


P, = probability of announcing that a (any) 
signal is present when indeed there is 

a (some) signal present, 

= probability of announcing that a (any) 
signal is present when actually only 
noise is present, 

From the discussion in preceding sections it 

should be ob¥ious that the corresponding optimum 

decision operation is to compare 


= P(A, cw, ws) 


and Pep 


1. The vrobability of each of the M siznals to- 
cather with the probability, PW@2O-of noise 
alone, 


212 


with a.threshold. Using the assumption of ortho- 
gonality (which assures statistical independence 
among the terms of the sum) and assuming that 

is large enough so that the central limit theorem 
of statistics may be employed, it is a more or less 
straight-forward problem to show that the receiver 
detection characteristic has sie eeu the form 
of the solid lines of Fig. 1 where the parameter, 
R , is to be interpreted, not as equal to 4’/Me 
but rather 


Rx 8% —/07A (3.2-2) 


Actually ‘the epproximations are such that this 
expression for A is slightly pessimistic. 

The most important -conclusion to be derived 
from this result is that the value of WY“. re- 
quired for a given performance is only logarith- 
mically dependent on. Indeed we would be 
justified in stating that to a first approximation 
the required value of AYA is independent of M, 
The essential truth of this statement is perhaps 
best illustrated by an example. For "reliable" 
detection (e.g., P, = 0.99, Pp = 10.79) the A of 
Fig. 1 typically must be the order of 50. If M= 
20,000 then 4% must equal 60 which is an in- 
crease of just 0.8 db over the Value of ##/#M (=50) 
which would be required for the same P, and P, if 
M21, ive., if @ and & were fixed and known 
a@ priori. 

As a second example we consider a decision . 
process of a somewhat different nature. We post- 
ulate the existence of a receiver with M output 
channels, one for each possible echo signal. Each 
channel has two output states corresponding to 
"signal present" and "signal absent" and we sup- 
pose that the channels are so interlocked that 
only one channel can indicate "signal present" at 
a time. 

We then seek the optimum receiver detection 
characteristic relating 


Pa = probability thet a particular signal 
will be announced if indeed that par-_ 
ticular signal is present, 

and Pp = Pe 


22 es 
eae ith. : 
= probability that the i signal will be 


i announced if indeed the i signal is 
not present. 


Although this criterion is obviously quite a bit 
stiffer than that discussed in the first example, 
we cannot argue that the resulting detector will 
be ‘better or worse than the first detector un- 
less the use to which the data is to be put and 
the corresponding appropriate value judgments are 
considered. Our purpose, however, is not to com- 
pare these criteria but rather to show that even 
in this second case a large increase in Ml re- 
quires only a small increase in the value of AA 
to’keep the reliability of detection constant. It 
is easy to argue that the optimum detection process 


where Pp 


1. We assume that:all signals are treated alike 
so that Pz is the -same for each channel. 


in this second case is to compare /1, separately 
with a threshold for each signal, announcing that 
signal present, if any, for which the corresponding 
A, is most in excess of the threshold.! As a re~ 
sult of the assumed orthogonality among the sig- 
nals. the solid curves of Fig. 1 may then be inter- 
preted as plotting the relationship between P, and 
Pre Assuming that ali signals are treated alike 
Pp =M be (3.2-3) 


To illustrate (using the same example before)-~ 
suppose P, = 0.99, P, = 10-5, M «= 20,000. Then 
Pe, = 5x 10-10 and the required value of 7’/We 
is (8.8)2 = 77.4 which is a 1.9db increase over 
the value of A*/We required for ® and «» known 

& priori and a:l.ldb increase over the value re- 
quired in the first example--in neither case a 
very significant amount considering the size of M, 


4.0 Detection in the Gase of Continuous 
Parameter Distributions; Accuracy 


and Ambiguity. 


As soon as we pass from the discrete a pri- 
ori distribution assumed for @ and &# in the pre4 
ceding sections to more realistic continuous dis- 
tributions, a whole host of problems arise. 

These problems are not only of a mathematical or 
computational nature but also, as we saw in a 
rather elementary way in the example of section 
3.2, involve quite complicated questions of value 
judgments and problem formulation, Roughly the 
aifticult is that it is no longer possibie to 
state performance measures in plack and white 
terms;,there are various ways or degrees of being 
wrong and the penalties must be weighted accord- 
ingly. Nevertheless, if we have any hopes of 
evolying a useful theory we must face these pro- 
velms, even if our conclusions are more qualita- 
tive than quantitative. 

As before we shall assume that at most one 
target echo. is present during any observation 
interval and that the initial phase angle of the 
return is random and informationless., We shall 
further assume, for simplicity and to be definite, 
that: 
1. The amplitude, A’, is known and constant 
independent of @ and w, 

2.. The a priori probability density hile) 
is = constant for all values of @ and & 
lying inside the rectangle in the @ plane 
bounded by the lines e720; t= @, wete , 
In other words all pairs of values of & 
and #& satisfying the conditions o¢tT4 9 
and fa # GYevill be assumed equally likely. 


Other, less restrictive, assumptions can be hand- 

led, ‘at least qualitatively, but these will serve 

to give the principal outlines of what can be-done, 
Loosely, we shall be concerned with three 


1. The similarity of this process to that car- 
ried out in the usual range-gated radar--partic- 
ularly those of the so-called "predetection inte- 
gration" type--is perhaps worth pointing out. 


213 


questions. The first of these is essentially the 
same question considered previously, i.e., what 

is the reliability of detection, where by detec- 
tion we have in mind essentially the same sort 

of decision as in the first example of section 
3.2. And, indeed, our method of handling this 
problem will be to replace the actual continuous 
parameter situation by an approximately equivalent 
discrete orthogonal problem of the type analysed 
in section 3.2. The other two questions are new. 
One is the question of the accuracy with which 
the parameters © and g can be measured once it 
has been ascertained that an echo is indeed pre- 
sent. We have considered the question of accuracy 


before with respect to the measurement of ampli- 


tude in the canonic detection problem. Essenti- 
ally the same method will be employed here for 7 
and 4), The third question concerns the pos- 
sibility of ambiguity, i.e., are there other val- 
ues of @ and & significantly removed from the 
proper values which might conceivably be mis- 
construed as the right values. Actually there 
are two ways in which an ambiguity might arise. 
One possibility is that the noise accompanying 

a particular echo might look sufficiently like 
some other echo that the a posteriori probability 
of this latter signal might be large. This type 
of ambiguity really has more the character of a 
false alarm--if the detection is "reliable" than 
such ambiguities should be rare. On the other | 
hand the structure of the signal may be such that 
two echoes from different targets may look much 
alike, e.g., the "second-time-around" target in 
conventional pulse radar. This is the type of 
ambiguity we wish to study. The principal ob- 
jective of our study, of course, is radar synthe- 
sis. Hence, with respect to accuracy and ambig- 
uity we shall seek both for those attributes of 
the radar which influence these problems and for 
appropriate limit theorems or realizability con- 
ditions on the types of accuracy and ambiguity 
performance which can be obtained. We shall find, 
of course, that unlike the reliability of detec- 
tion Broblem, the important parameters influen- 
cing accuracy and ambiguity are those which des- 
cribe the waveform, and we shall discuss an im- 
portant performance constraint on the waveform 
which perhaps deserves the title of the Radar 
Uncertainty Principle. 

As a result of the various assumptions which 
we have made, the a posteriori probability density 
for the present situation wan be written in a some+ 
what simpler form than (2-3). Specifically, 


Plrew)= kA) 


where we have absorbed into # a multitude of 
terms including the a priori probability density. 
Now it should be obvious that if we are going to 
judge the reliability of detection in the present 
éase on the same basis as the first example of 
section 3.2, i.e,;, in terms of 
P, = probability of announcing that a (any ) 
signal is present when indeed there is 
a (some) signal present; 


(4-1) 


= probability of announcing that a (any) 
signal is present when ao coer only 
noise is present; 

then by analogy with that example the optimum de- 

tection procedure is to compare 


e [als) dw 


e -~A 
with a threshold. We can even perhaps imagine an 
infinite number of cross-correlators followed by 
non-linear weighting devices, a summing circuit, 
and a comparison circuit which wauld physically 
carry out this operation. But a real ditficulty 
arises when we try to compute P, and P, since the 
noise outputs o1 these individual channels will 
not in general be indeperident--the corresponding 
Signals will not in general be orthogonal or un- 
correlated. 

The question of the way in which the various 
possible echo signals are correlated is a most 
important one for our study since it not only in- 
fluences the value of 4’/Me required for a given 
reliability of detettion+, but also has a major 
effect on the accuracy and ambiguity question. 

To see this tet us consider two signals--. . 

Sw, (t—-T)and su, (€-%), Let us then compute bot 
Ay (Tm) and A, (ts,4.)in the case in which o,l¢- 
is actually present 


and P 


(4-2) 


Ni ti, 4) = Nepeep is, le oe 


(4.3) 


= [hr [ow Ste eye) 
A (Gs) = / Mf OA So, e+) SC tm erehy/ 
ELS, lee) 5, Cote) 


1 foods 


If, as we have argued before, the detection is to 
be. "reliable, then the first term in (4-3) must be 
much larger. than the second* so that 


1. As we have anticipated and will show this 
intluence is actually rather small. 


BO (et) ale “Se (4.4) 


2. The mean square values of the real and im- 
aginary ‘parts of the second term in both (4-3) 
and (4-4) are each equal to MW. 


N, (1) 3 (4-5) 


If in addition Sau 4-t) and Sey,4-&Jare highly 
correlated, ky which we mean that 


Hfsirte 2),S% eee) dt fra / 


(4-6) 

then it will also be true that 
TANG (ts ~) = A. (4-7) 
Thus “A, Gre ~,) aA Ce .) (4-8) 
and so lat cw, ) 4 Plz, -), (4-9) 


Whether PCO; will actually be greater 
than or less than wre, ot) will depend on the 
particular noise waveform present, but will not 
depend very much on which signal is present. In. 
other words, when Sw,(#-&) is actually present, 
we can not really be sure if it is Sau,(@-%) or 
Su, l¢-&) whicn is present. Thus if two possible 
received waveforms are highly correlated in the 
sense of (4-6) and one of them is present, the 
determination of which one is present is tunda- 
mentally ambiguous and no-amount of clever data 
processing can resolve this ambiguity. 

In a similar situation suppose that for some 
fixed value of wad4% and for all values of 2 
lying in some interval At centered on ®% the cor- 
responding signals 4.4/¢-t) are highly correlated 
in the sense of (4-6). Then if one of these sig- 
nals is present we shall not be able to determine 
which one, i.e., we shall not be able to measure 
@% with an accuracy greater than the order of Ae. 
This, of course, is-in essence the same argument 
as we employed in section 3.I with relation to the 
accuracy of estimating. 

- . To be sure the accuracy and ambiguity sit- 
uations depend not only on the degree of correla- 
tion. of the various signals but also on the ratio 
@YAMe --we shall have more to say about this in 
what follows. But the important point is that the 
limiting possible performance with respect to ac- 
curacy and ambiguity depends not on ingenuity in- 
processing the received signals but rather on the 
shape of the received signals themselves and in 
particular on the extent to which the®various re- 
ceived.signals are ca@grelated. It behooves us, 
therefore, to study the various possible forms 
which this correlation can take. Such a study 
will permit us not only .to give more-or-less com- 
plete answers to the detectability, accuracy, and 
ambiguity questions in particular cases but will 
lead to one of the most important theorems con- 
straining ra performance-~the Radar Uncertain- 


ty Principle 


4.1 Waveform Examples; Radar Uncertainty Principle 


We are interested in the behavior of the 
function 


pleju 9. t oe (€-2).S2, Ge-rJdt/ 
Al xostlece) esse] 


(4,161) 


where 
wl 2s %- 2 


wl = “ew, 


and the second expression has been obtained from 
the first by an elementary change of variable. 
Physically f(@/e) can be loosely interpreted as 
the output in the absence of noise of a cross- 
correlator corresponding to a particular signal 
when a second sigmal with a delay less by @! and 
a frequency shift less by ew’ than that particular 
signal is present. Alternately ple w), consid- 
ered as a function of 2’ and with et’ interpreted 
as time, is, except for a scale factor and ig- 
noring noise, precisely the time waveform cor- 
responding to the envelove of the output of a 
matched filter for a signal at a particular fre- 
quency when a signal at a frequency w! less is 
present. Although P(e4e') has a number 6f fee 
esting and important mathematical properties&4it 
is perhaps more expedient for our present pur- 
poses to procede to a consideration of several 
examples. 


4.11 Impulse or CW Radar 


We shall assume for our first example that 
the received waveform is a single pulsed-sinusoid 
of constant amplitude and duration, 7. Thus, 
recalling the normalization of equation (1-2) 


YF; os ttT 
, (t) = 
ORE 


Physically there are two interesting extreme sit- 
uations to which this example might correspond 
a. CW Radar--where J is essentially the 
"time on target" and typically 


7T>> @ 


(4.11-1) 


elsewhere 


botie7 tec oA 
a 


b. Impulse Radar--where 7 is now simply 
the impulse duration and typically 


7 << @ 


but 277 
— >> —2 


(ec, is shown in Fig. 2 plotted as amplitude 
above the @!w’plane. We note that $e) is the 
highest point in the plane--which is reasonable 
since W8@ is proportional to the output of the 
cross-correlator when the corresponding signal is 
present. Indeed we could” prove the general re- 
sult that for any waveform ay, © of finite dura- 
tion 

flgopul > Pld) { RjutO (4.1118) 
Philosophically this equation can be interpreted 
to mean that if the noise is vanishingly small, 
the parameters of an echo signal can be determined 


with perfect accuracy and without ambiguity--a 
physically satisfying conclusion. 

An alternative way of representing Jl tres) 
is shown in Fig. 3 arid 4 where we have chosen the 
@” and w’ scales so as to more clearly illustrate 
the difference between CW and Impulse Radar re=- 
spectively. In these two figures, as in most 
of the remaining plots of ple; we) in this paper 
we have chosen to indioate the magnitude of ~f/u) 
as the density of shading in the two-dimensional 
@'-w' plane. Moreover, for simplicity we have 
somewhat arbitrarily restricted the degrees of 
shading to just three--black corresponding to 
highly correlated regions, i.e., ff{w)os 1, crosa 
hatch corresponding to weakly correlated regions, 
i.e., f(@/w4 2O© and unshaded corresponding to 
uncorrelated regions, i.e., p(tfoJ xO, 

Let us. now consider the problem of detere- 
mining the receiver detection characteristic for 
the casé, say, of the CW radar of Fig. 3. We 
first note that it is not necessary to provide 
a channel in our detector for every possible pair 
of values of @ and «# as we assumed earlier. For 
example, since @<<7 the channel corresponding to 
fs O and any particular #& will have an out- 
put under all circumstances which is very highly 
correlated, i.e., nearly identical, with the out- 
put of the channels for the same # and any rL£@ 
Thus we could replace all these channels in the 
integration (4-2) with just one channel, say @-s 

- Similarly in frequency except channels sep- 
arated 2%7fA- from a given channel are nearly un- 
correlated with the given channél. Thus approxi- 
mately a7f4¥ channels are required in frequency 
to cover the expected range of targets and the 
outputs of these channels are nearly orthogonal. 
Thus finally the detection performance in the case 
of a CW radar cannot be very different from that 
of the first example of section 3.2 withgp «gy, 
Of course we could easily be off by a factor of 2 
or more in this value of an equivalent # but 
since the required value of 4*%#% depends on #1 
only logarithmically such an error has negligible 
significance. Moreover the whole argument is 
somewhat academic since, as we showed in section 
3.2 the increase in #@/e required even for a large 
value of Af is quite small. Of course a pre- 
cisely dual argument holds in the case of the im- 
pulse radar, the required value of 4f being the 
order of OF . f (4] 

Next we shall examine the accuracy *°*”° with 
which the parameters @ and as can be determined 
in the case*of a CW radar. Of course, since for 
the allowed variation in ® , all possible signals 
are highly correlated, essentially no estimate of 
@ can be given unless #Y¥M% is very large, as is 
physically obvious. Measurements of e# are more 
interesting. Using the same argument as in sec- 
tion 3.1 the procedure is to identify the variance 
in the measurement of # with the mean square 
width of the spike in 4/@w) computed in the ab- 
sence of noise. As should be readily apparent 
this is precisely the same as the mean square width 
of the spike at the origin of Jo C% slaw) ri 
Making the usual approximations we obtain 


ave 


Oo, = (4.11-2) 


as’ the standard deviation of the error in & as- 
suming #*/4%>>1. If we take 2#/7 as the width in 
frequency of the central spike in fod and if 
we assume A*%/y = 50 then (4.11-2) states that we 
should be able to determine @ to about 1/10 of 
this width. This is somewhat better than we are 
able to achieve in practice because of the effect 
of systematic errors which have been ignored. 
Equation (4,11-2) is a special case of a general 
result which states that 


(4.11-3) 


where # is the root mean square duration about 
the mean of the signal. An. essentially similar 
result holds in general for time measurements. 


oe creee 


where A(radians/see) is the root mean square band- 
width about the mean frequency of the signal. 
Unfortunately the approximations on which this 
latter formula is based breakdown for a square 
pulse. Using a slightly modified procedure one 
obtains the formula’ 


V2 7 
B* file 


which is probably a trifle optimistic, even in 
theory. 


(4.11-4) 


fae (4.11-5) 


4,12 Linear-sweep FM-CW Radar 


The possibilities with the preceding example 
were rather limited. Since the bandwidth for a 
single pulsed-sinusoid is rougnly just the recip~ 
rocal of the duration it is generally not possible 
to, find a value of 7~ which will simultaneously 
give acceptable accuracy in both range and velo- 
city. The obvious strategy, then, is to modulate 
the signal so that the bandwidth can be made many 
times larger than the reciprocal of the duration. 
A simple possibility would seem to be to frequency 
modulate the signal, e.g., to let 


lt) © 
oO 


3 elsewhere 


(4.12-1) 


fhus the frequency of the pulse increases linearly 
from w, at 20 to uted kTe wotWat t=T7 , 
b[tjw) is readily computed ror this waveform 
and the result can roughly be represented as in 
Fig. 5. As we can see our strategy has not been 
entirely successful. ‘to be sure, we observe that 
we can make a measurement of @° with an accuracy 
the order of “Ay (roughly an improvement of 7* 
over a pulsed-sinusoid of the same duration) but 
only if we know the proper value of #. And con- 
versely we can determine # to an accuracy the 
order of 2¥/ if we know the proper value of ® . 
Bet if we know neither @ nor #& the best we can 
determine is a relationship between # and & of 
the form 4%= 4 = constant dependent on signal ree 


216 


ceivedie Of course this result is hardly unexpec- 
ted from a physical point of view since the wave- 
form resulting from a frequency shift and from an 
appropriate time delay are very similar. Com- 
pared with a pulsed-sinusoid of the same duration, 
what we had hoped to achieve by frequency modula~ 
tion was a compression in @’of the region in 
which g/e4) is large (black area in Fig. 3) from 
a width w 7 to a width a “% without a correspond~ 
ing spread in-tne w' direction, What we did not 
anticipate perhaps is that the volume of ¢lejw‘) , 
instead of being compressed as we squeezed along 
the @ axis, has leaked out into the first and 
third quadrants. Alternately what we have achieved 
is a rotation rather than a compression of the CW 
characteristic of Fig. 3. 

we shovid not conclude, however, that FM-CW 
has little advantage over ordinary CW as a radar 
waveform. In many practical cases is smail 
compared withW and velocity information is of lit- 
tle use. In tnis case FM-C' has essentially the 
accuracy and freedom trom ambiguities of an im- 
pulse radar having the same bandwidth, together 
with the lower peak power for the same energy 
characteristic of the CW case. Moreover, assuming 
that we can guararitee that we are looking at one 
and the same target, it may be possible to make 
a second measurement with a difrerent k (e.g., —k) 
and thus to determine both @ and «* with high 
accuracy and without ambiguity. Nevertneiess, 
compared witn otner other wavetorms to be con- 
siderea, the price of the FM-CW radar is high in 
terms of bandwidth ror the results obtained. Its 
principle advantage is simplicity of implementation, 


4.13 Coherent Periodic Pulse Radar 


she commonest radar waveform, of course, is 
the periodic pulsed-sinusoid with or without 
various minor variations. We shall assume for 
analysis that Sef) is real and has the form in- 
dicated in Fig. 6a. By implication we are thus 
assuming that carrier phase is coherent from pulse 
to piilse, i.e., that the pulses are merely bursts 
selected from the same continuous sinewave. A . 
closely related waveform is generated physically 
by starting an oscillator from noise separately 
for each pulse so that the carrier phase is ran- 
dom from pulse to pulse, The performance obtained 
with this second, incoherent, waveform is slightly 
different in some respects from that to be des- 
cribed as will be mentioned in a later section. 

¢(t4«) for the waveform of Fig. 6a is 

shown in Fib. 6b and 6c. Clearly the accuracy of 
simultaneous measurements of 2 and — is now es- 
sentially the best which can be expected for a 
waveform of this bandwidth and total duration. 
More exactly, equations (4.11-2) and (4.11-5) may 
be applied to this case yielding 


2ve 
Ts = > VRE (4.13-1) 
r, Vz Ss ‘V2 IT 
or * “Ne WAV. (4.13-2) 


lL. Clearly the best way to describe this situa- 
tion would be in terms of the parameters of an 
ellipse in the @*as plane. 


plriw) (0 


ii eeu 


er 


- “¥ Ts4a 
ee be Oo ries 
4, PE Se pte 
Be ew one That 
WS; fe IY LEB NN Hi ! W a4 
SS 7 LZ Z- 1 \Bz | © 
ZB L- | ale Z 


| - 
uy 


so4 each] 
‘a Ne. 
ALAA A I 
et at 
-T ar y am 
Pig. 3: Y¥(@jv) for Ci) Sinusoid. : Na \ 
3 2 : N 
YEN 


Fis. 6: Plt} w') for Periodic Pulsed Sinusoid. 


T Catt 


x 


cs: eat" Pe tadar. 
~ “:\ 


Dalya. 15) dle w') for Linear-Sweep IM-C'. ie (ee eS for Noise ‘/aveforn. 


217 


where 7, 5 and W have the significance indicated 
in Fig. 6a and A*is to be interpreted as the 
total received energy. 

Although the accuracy situation, thus, leaves 
little to be desired in that essentially the best 
possible performance is obtained within the al- 
lowed bandwidth, apparently a new difficulty 
has appeared. Assuming that @>4 and/or nn >2I/a 
there will now be ambiguities in the determination 
of @ andfor w., These correspond to the famil- 
iar "second-time around" echoes and "blind velo- 
cities." It is true, of course, that the spikes 
in ¢(ewJ) at multiples of the repetition rate and 
repetition interval are smaller than the spike at 
the origin, and hence if the ratio BAe is large 
enough, the highest spike in P4&) should with 
high probability correspond to the correct para- 
meters. Some idea as to how large ahh must: be 
can be acquired from the following argument. Sup- 
pose that there is only one possible alternate 
pair of values which might be confused with each 
target. Such a situation might arise, for ex- 
ample, if the a priori information made it pos- 
sible to discard other alternatives as unlikely 
or impossible, e.g.,@<O and *%<nr <4], 
Let $4@/u') be the height of the ambiguous spike. 
We wish, then, to compute the probability that the 
wrong spike in P/&w~)will actually be higher (be- 
cause of noise) than the correct spike. Clearly 
if ¢ (6, Jef then there is complete equivocation; 
both spikes in P(g) will under all conditions be 
identical in height and we can say that the prob- 
ability of error is 0.5. We can compute the re- 
lationship between #4%% and #(Sfes') such that the 
probability of error is less oye" some artitra 
value, say 0.10. The result* is approximately 


34 5 
Jlalwl< s— Pe (4.13-3) 
Thus if 4(@/u) = 0.97, 99460 would have to be 
greater than~115 before the probability of error 
would be less than 10% Of course (4.13-3) is a 
theoretical result and ignores such questions as 
distortion and drift. Independent of the value 
of A*44%, it would be necessary to go to a great 
deal of trouble to build practical equipment cap- 
able of distinguishing reliably and over a wide 
dynamic range a difference as small as 3% For 
most practical purposes spikes in pfJa) greater 
than, say, 0.5, constitute unresolvable ambigui- 
ties. 

Finally the presence of ambiguities has a 
significant effect on the value of AV to use in 
equation (3.2-2) for estimating the reliability 
of detection. If both @<S and-&<*Bthen it. 
is easy to argue that the appropriate value of M1 
is roughly 

SCE 

IM x 2 ' 27 (4.13-4) 

On the other hand if @2S anan2¥ then not all 
of the signals given by (4.13-4) will be indepen- 


1. Actually (4.13-3) applies to the case of known 
initial phase angle, but the error inA%& re- 
sulting from applying (4.13-3) to the random ini- 
tial phase case is small. 


dent so that the appropriate value of 4 is 


a 


—— 


& 


27 A Fe 


J 
Mx 297/r > TE 


(4.13-5)| 
i.e., equal to the time-bandwidth product for the 
signal. Considering the implicati¥ns of the 
Sampling Theorem“ concerning the number of de- 
grees of freedom in a signal of limited time and 
frequency duration, this*is-an entirely reason- 
able result. 

BI 


4.14 The Radar Uncertainty Principle 


It should at this point be obvious that 
Slel wd corresponding to the ideal radar wave- 
form should have the appearance of Fig. 7--a sire 
gle narrow spike at the origin and nothing anys _ 
where else in the plane. For maximum accuracy the 
spike should have a width of. approximately#7/ in 
frequency and24/g in time, and 7 and&7 should 
be independently adjustable. The difficulty is-- 
as we might expect--that such conditions are fund- 
amentally impossible to achieve. We now wish to 
study why. 

Our efforts thus far to achieve a waveform 
having a #(tje) similar to Fig. 7 have been kind 
of like squeezing a pillow--as we push in one 
direction the pillow bulges out in the other, and. 
if we are too persistent the casing breaks and we 
have piles of reathers all over the landscape. Or 
perhaps a better analogy would be to imagine that 
the J/&fw) contour is the surface of a pile of 
sand. As we adjust the waveform we seem to be 
able to move the sand around but unable to get 
rid of any of it. This latter analogy, with one 
modification--namely that we should talk about the 
P70E) ') contour instead of the Jaw) contour, 
1s actually a precise statement ot the most im- 


‘portant limitation on radar accuracy-ambiguity 


218 


‘performance, i.e., what we shall call the Radar 
Uncertainty Principle. But before we give a pre- 
cise formulation of this principle it is perhaps 
valuable to demonstrate it in an approximately 
quantitativemanner for the various waveforms we 
lhave thus far considered. For most radar wave- 
forms ¢(e}er) consists roughly of a number of 
spikes of approximately unit height. together with 
regions in which f(@/»') ps 0. An approximate 
evaluation of the volume under the P>/t40) con= 
tour can be achieved by replacing the spikes with 
roughly equivalent cylinders of unit height and 
ignoring the volume in the regions where ff/Jms 0. 
For the three waveforms thus far considered this 
crude volume computation is as follows: 


Gee Area of ma [eevertiats Ni 
eight Cylinders of Cylinders 


Cbs: {(7) eas CF) «{ oA |= 217 
pac (22) me &)) seme | = 277 
aw [5 x (25)| [G@)-E5 


oi Volume 


*5eay \* 27 


Although it is something of an accident that this 
rather crude method works so nicely in these cases, 


the conclusions nevertheless are correct. More 
formally, a precise statement of the Radar Un- 


certainty Principle is: 
Independent of the form of WC; wd 


Felder | Glee ddet = / 


-—-2 -# 


(4.14-1) 


This result is easily proved by directly carrying 
out the indicated integrations after substituting 
the definition of ¢ftle*Jfrom (4.1-1). 
The Radar Uncertainty Principle has a num- 
of important applications: 
1. As a part of the a posteriori probabil-e 
ity approach, the Uncertainty Principle 
helps to emphasize that waveform selection, 
rather than ingenuity in detector:design, 
is the determining element in radar accuracy, 
ambiguity, and (as we shall see) resolution 
performance, 
2. By setting a bound on performance qual- 
ity, it prevents much fruitless searching 
for waveform and detection methods intended 
to achieve such impossible performance as 
that described at the beginning of this 
section. 
3. The Radar Uncertainty Principle has 
proved very helpful in finding the flaws 
in various suggested radar waveforms, de- 
tection procedures, MTI schemes, etc. Spe- 
cifically, one can be sure that the analysig 
is complete if and only if all of the volumd 
under the ¢/%wIcontour has been accounted 
for. 


Unfortunately, although the Radar Uncertain- 
ty Principle represents a necessary: condition for 
the existence of a waveform Se @ having a given 
¢leiw') it is not sufficient. A number of ad- 
ditional conditions can be specified, including 
sufficient conditions, but the forms of these ~ 
conditions are not sufficiently simple to be 
really useful in waveform synthesis. There seems 
to be no real substitute at this point for an ed- 
ucated but intuitive guess followed by careful 
analysis. 


4.15 The: Ideal Waveform 

We shall conclude this section with a dis- 
cussion of several waveforms which come as close 
as possible to the ideal’ radar waveform--at least 
from the standpoint of accuracy and freedom from 
ambiguities. These waveforms have essentially 
the plete) plot of Fig. 8, i.e., a single cen- 
tral spike of width ®%A- in frequency and?"Aw in 
time (where 7 and W are the echo duration and 
bandwidth in rad/sec respectively) with the re- 
mainder of the necessary volume (if Wr is 
large this will he nearly all the volume ) spread 
out more-or-less uniformly over a region roughly 
7- wide in time and W wide in frequency, 

All of the waveforms corresponding to Fig. 8 
have in common that they are in some sense noisy 
or pseudo-random--by which we mean that it takes 
many numbers to specify them as opposed to the, 
waveforms we have already considered which are 


cluding the origin. 


specified by just a few numbers, e.g., pulse length, 
repetition, rate, ete For example, if S:f#is a 
sample from almost any sort of noise, e.g., hard- 
limited narrow-band Gaussian noise of duration 7 
and bandwidth Wh7 , the. corresponding ¢¢e/—~) will 
with sete probability look like Fig. 8 provided 
that w is very large, the order of 10? to 0° 
or B04. But when 7Aw is only the order of 10 

to 10”, the noise waveform has to be selected with 
some care if spikes of height 0.5 or more at vale 
ues of @ and w‘ other than the origin are to be 
avoided. 

An interesting example of a suitable wave- 
form having 744m = 10? or-less is the coded- 
pulse waveform. This waveform is constructed by 
‘starting with a pulsed sinusoid of duration 7 . 
This pulse is then divided into 7“ intervals, 
each of duration @%e, Each interval is then 
preserved as originally, or reversed in phase by 
180°, according to whether the corresponding pos- 
ition in a binary sequence or code of length Tr 
is O orl. Clearly the performance of such a 
waveform then depends on the code selected. 
Almost any pnase-modulated waveform having a band- 
width W or less can be closely approximated by 
choosing the proper code. At the moment we are 
interested in that code or codes which for a 
given value of 7“%4w will minimize pj for 
te w’gO, The problem of finding sucn a code 
is closely related to the coding proolem in in- 
formation theory and many of the methods employed 
there are applicable. For example, for 7““Am = 
2” -1 the best codes yet found (and there is 
reason to believe no better codes exist) are those: 
called by various authors shift-register or 
null sequence codes of maximum length. An example 
of such a code forn = 5, FliZg = 31 is the fol- 
lowing 

6110100001100100111110111000101 


which is obtained bystarting off with the code of 
length 5, 01101 (any other starting code except 
00000 will yield equivalent results), and set- 
ting the next element equal to the sum modulo 2 
of the first, second, third, and fifth digits pre- 
ceding. This process is repeated for each syc- 
cessive element. It will be found that after 31 
elements have been*written down, the sequence will 
repeat, and indeed, the fact that the code does 
not repeat prior to the 2 -l element is a suffici- 
ent test that a proper parity check rule has been 
employed. For any value of n there are a number 
of such codes--all of which, so far. has been de- 
termined, are equally satisfactory for our pre- 
sent purposes. It should be recalled that we are 
interested, loosely, in minimizing the maximum 
value of @e/e’) for all values of &" and &” ex- 
In particular it would not 
be sufficient alone to minimize J/r{eY along the 2’ 
axis, i.e., fores’= 0. Codes can be found which 
have better performance along the @’axis than the 
maximum length null sequences, but such codes in- 
evitably seem to have large spikes, i.e., poten- 
tial ambiguities, off the @’ axis. Experience 
would. seem to suggest that maximum-length null 
sequence codes yield a maximum value of #/tj= 
(excluding the origin) the order of ¥a®/7rw , 

We are justified in concluding that--from 


standpoint of accuracy and ambiguity--coded- 

pulse or other noise-like waveforms achieve es- 
sentially the ultimate possible performance. Why 
then have such waveforms not had a wider applica- 
tion in radar? To be sure, an appreciation of 

the value of such waveforms is relatively recent 
and there is a feeling, which we do not completely 
share, that the equipment to generate and process 
such waveforms is impractically complicated. But 
the most important reason is that noise waveforms 
have in many practical target situations serious 
disadvantages from the standpoint of resolution. 
This is a problem which we now wish to investigate 
in general. 


5.0 Resolution 


Thus far we have considered onlycases in 
which it was known a priori that at most one tar- 
get echo was present at any one times Such a 
fortuitous situation almost never occurs in prac= 
tice. At the very least we have to discriminate 
against our own transmitted signal which repre# 
sents a huge signal at zero range and velocity. 
Moreover, in many cases ground clutter, chaff, 
the ionosphere, meteor trails, etc.. return echoes 
which not only contain little useful information 
but which, if they have a large amplitude, may 
obscure the echoes from desired targets. Indeed, 
it is probably only a slight exaggeration to 
claim that in many cases the problem of resolving 
desired echoes from undesired echoes and from one 
another is so important as to be the principal 
requirement on the radar design. - Freedom from 
ambiguities, for example, might be a nicé thing 
to have, but not if it can be obtained only with 
a reduction in resolution performance. 

Despite the importance of this problem, 
there is remarkably little we can say about it of 
any general validity of utility. There are, of 
course, several obvious platitudes which despite 
their triviality help to formulate the problem, 

1. ‘If the interfering signal is known com- 
plete and exactly (i.e., if @2% #, and the car- 
rier phase angle are known precisely) then there 
is really no resolution problem since the obvious 
and theoretically correct procedure is to sub- 
tract a replica of the interfering signal from 
+t) prior to processing. 

2. If the class of undesired signals is 
identical with the class of desired signals then 
there is obviously no possibility for resolution. 

Thus the resolution problem is theoretically 
and practically an interesting one only if the 
exact characteristics of the interfering signais 
are unknown in one or more respects and different 
in one or more respects from the desired signals, 
There are, of course, many kinds of situations, 
but a large fraction of these are of little in- 
terest. For example, two signals, which are 
identical except for amplitude obviously can be 
resolved with high: reliability only if the most 
probable desired signal energy is much greater 
than the most. probable undesired signal energy. 
The most interesting cases are those in which the 
signals to be distinguished differ in time delay 
and/or frequency shift. Here the appropriate 
measure of the difference bétween signals is 


220 


$l) and there are two cases ‘of interest: 


1. Each desired signal is orthogonal to 
the entire class of undesired signals; 

2, Each desired signal is at most ‘weakly 
correlated with some of the members of the 
class of undesired signals. 


Clearly, resolution is almost impossible if the 
undesired signals are large and strongly correla- 
ted with the desired signals. ; 

In the first case resolution is trivially 
simple to obtain. It can easily be shown that 
the a posterior probability that any particular 
desired signal is present is entirely independent 
of the presence or size of the undesired signals. 
Hence all ofthe analysis in preceding sections 
with respect to reliability of detection, accuracy, 
and ambiguity is immediately applicable. Reso- 
lution in this case is obtained automatically. 

On the other hand, in the second case the 
situation is not nearly so clear. We might, of 
course, just pretend that the signals to be separ- 
ated are orthogonal and thus build our decision 
circuits as previously. The resolution perfor- 
mance under these conditions will be quite good, 
i.e., P, and Pr for the desired signals will be 
essentially unaltered by the presente of the un- 
desired signals, provided that the ratio of the 
energy of the undesired signal to the energy of 
the desired signal remains somewhat less than 
Ypleser) Thus a degree of relative resolution 


can be obtained. However. it should be possible 

to achieve somewhat better relative resolution 

by altering the form of the detector, e.g., in- 
tentional using an appropriately mis-matched fil- 
ter. Of course, the reliability of detection for 
desired signals will then be less, but in the 
presence of eorrelated undesired signals: such a 
loss in reliability of detection is inevitable. 

In certain hypothetical cases the best possible 
form for the mis-matched filter can be worked out. 
For example, if the class of undesired signals 
consists of a finite set of signals at known 
discrete values of @ and &® but with unknown 
afiplitude it is possible to design a detector 
providing essentially infinite resolution at only 

a (usually) small price in detectability. But — 
for the case of continuous parameter distributions 
no really satisfactory procedures are known... How- 
ever it seems unlikely that any truly significant 
improvements in performance can be achieved by mit+ 
matching the filter. Emperical methods for de- é 
signing clutter refection filters, for example, aré 
probably as good as any. 


The conclusion is inescapable that if reso- 
lution is important the radar waveform must be so 
chosen as to make the signals to be resolved as 
mearly orthogonal as possible. From this point of 
view the periodic pulse waveform has outstanding 
advantages constituting as good a reason as any 
for its overwhelming popularity. By permitting 
pmbiguities; the periodic pulse waveform manages 
to cram nearly all the volumes required under the 
em ee ee 
lh. Relative resolution is Clearly what we have in 
mind when we speak, for example, of sub-clutter 
visibility. 


Plein) surface into tall slender spikes, leaving 
most of the @“» plane absolutely empty. No 
other waveform is quite so well suited for those 
applications (e.g., GCI radars) in which resolu- 
tion capability is (or should be) tne pre-eminent 
design specification. However, when slight com- 
promises can be tolerated in resolution perfor-= 
mance, important advantages in other respects can 
be achieved by several variations in the coherent 
periodic pulse waveform, e.g., non-coherent. phase 
from pulse-to-pulse, a periodically-repeated phase- 
or-frequency modulated pulse, or a time-duplexed 
radar employing two repetition rates, each for 
half the echo duration. Other schemes, e.g., 
staggered repetition rate, or changing frequency 
from pulse-to-pulse, are less useful since they 
have a serious effect on resolution performance, 
we do not intend to create the impression 
that a*periodic pulse radar is an adequate solu- 
tion to the usual, let alone the extreme, resolu- 
tion situation; it is not. There are many 
radar systems in particular which fail to achieve 
the desired performance, 1f for no other reason, 
because ¢ither the resolution performance of the 
periodic pulse radar is inadeauate or because the 
ambiguities associated with this waveform are 
intolerabie. We feel rather strongly that within 
the constraints imposed by our model, 1.e., by 
our present radar system philosophy, a satisfactory 
solution to these problems is essentially imposes 
sible. In particular we believe that it will be 
necessary to break away trom the idea of a finite 
(i.e., short) observation interval with its as- 
sociated concept of the “occupied cell" in time 
and frequency. Our desired vargets have in gen- 
eral a time history or life pattern which is a 
much more distinguishing characteristic than their 
instantaneous position and velocity. We shall 
have to learn how to exploit this characteristic 
The philosophical, computationa], and practical 
problems appear, at this time, to be rather dif- 
ficult, but here, it anywhere, would seem to lie 
the future promise of radar. 


6.0 Acknowledgements 


Although the author takes fuil responsibil- 
ity for the opinions and resuits presented in 
this paper, little if any of the paper can truly 
be said to be original. In particular the now 
classic work of P. M. Woodward and 1. u. Davies 
underlies vhe entire paper, both in basic approacr 
and, 1n some cases, in detail, Im eddition the 
autnor nas profited immeasurably from numerous 
papers, reports, and discussions witn many people 
--more than ne can possibly acknowledge or perhaps 
in some cases even recall. Specifically the 
author wishes to acknowgfedge the cooperation and 


221 


contributions of his colleagues at Lincoln 
Laboratory and m.I.T., notanly Prof. R. M. Fano, 
R. M. Lerner, L. G. Kraft, R. Mamasse, and F. A. 
Rodgers, among many others, 


[.0 Bibliography 


The following list contains only those books 

and papers referred to in the text or deemed most 
appropriate to the matters under discussion, 
There are numerous other excellent and important 
papers on various aspects of this subject, most 
of which are listed in the bibliographies of the 
references cited. 


ft] P. M. Woodard, "Probability and Informa- 
tion Theory, with Applications to Radar’ 
McGraw-Hill Book Co., New York, N.Y., 
1955. 

P, M. Woodard and I. L. Davies, "A Theory 
of Radar Information} Phil. Mag., vol.4l, 
pp. 1001-1017, Oct., 1950. 

P. M. Woodard and I. L. Davies, "Infor- 
mation Theory and Inverse Probability in 
Telecommunication} Proc. I.E.E., Pt. III, 
vol. 99, pp. 37-44, Mar., 1952. 

[2] W. W. Peterson, T. G. Birdsall, and W. C. 
Fox, "The Theory of Signal Detectability} 
Trans. PGIT-4, I.R.E., Sept., 1954. 

{31 S. 0.. Rice, "Statistical Properties of a 
Sine Wave Plus Random Noise} B. S. 7. J., 
vol. 27, pp. 109-157, Jan., 1948. 

[4] R. Manasse, "Range and Velocity Accuracy 
from Radar Measurements!) unpublished 
internal report, Lincoln Laboratory, 
Mass. Inst. of Tech., Cambridge, Mass., 
WE 5 Ac ; 

[5] C. W. Helstrom, "The Resolution of Signals 
in White Gaussian Noise! Proc. I.R.E., 
Vol. 45), pp. LIT —IaS oept ye 955. 

16) D. A. Huffman, "The Synthesis of Linear 
Sequential Coding Networks} Proc, Third 
London Symposium on Information Theory, 
Sept., 1955. 

(7) N. Zierler, "Several Binary-Sequence 
Generators Tech. Rep. 95, Lincoln Labo- 
ratory, Mass. Inst. of Tech., Cambridge, 
Mass., Sept., 1955. 


NOTES 


222 


NOTES 


223 


NOTES 


INFORMATION FOR AUTHORS 


Authors are requested to submit editorial correspondence or technical manu- 
scripts to the Publications Chairman for possible publication in the PGIT Trans- 
ACTIONS. Papers submitted should include a statement as to whether the material 
has been copyrighted, previously published, or accepted for publication elsewhere. 


Papers should be written concisely, keeping to a minimum all introductory 
and historical material. It is seldom necessary to reproduce in their entirety previ- 
ously published derivations, where a statement of results, with adequate references, 


will suffice. 


To expedite reviewing procedures, it is requested that authors submit the 
original and two legible copies of all written and illustrative material. The manu- 
script should be double-spaced, and the illustrations drawn in india ink on drawing 
paper or drafting cloth. Each paper should include a carefully written abstract of. 
not more than 200 words. Upon acceptance, papers should be prepared for publica- 
tion in a manner similar to those intended for the PROCEEDINGS OF THE IRE. Further 
instructions may be obtained from the Publications Chairman. Material not accepted 
for publication will be returned. 


IRE Transactions on InrorMatTion THEORY is published four times a year, 
in March, June, September, and December. A minimum of one month must be 
allowed for review and correction of all accepted manuscripts. A period of approxi- 
mately. two months additional is required for the mechanical phases of publication 
and printing. Therefore, all manuscripts must be submitted three months prior to 
the respective publication dates. In addition, the IRE Convention RecorD is pub- 
lished in July, and a bound collection of Information Theory papers delivered at 
the annual IRE National Convention is mailed gratis to all PGIT members. 


All technical manuscripts and editorial correspondence should be addressed to 
Laurin G. Fischer, Federal Telecommunications Lab., 492 River Road, Nutley, N. J. 
Local Chapter activities and announcements, as well as other nontechnical news items, 
should be addressed to Nathan Marchand, Marchand Electronic Labs., Riversville 
Road, Greenwich, Conn. 


